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Abstract —We develop necessary and sufficient conditions and a 
novel provably consistent and efficient algorithm for discovering 
topics (latent factors) from observations (documents) that are 
realized from a probabilistic mixture of shared latent factors 
that have certain properties. Our focus is on the class of topic 
models in which each shared latent factor contains a novel word 
that is unique to that factor, a property that has come to be 
known as separability. Our algorithm is based on the key insight 
that the novel words correspond to the extreme points of the 
convex hull formed by the row-vectors of a suitably normalized 
word co-occurrence matrix. We leverage this geometric insight to 
establish polynomial computation and sample complexity bounds 
based on a few isotropic random projections of the rows of the 
normalized word co-occurrence matrix. Our proposed random- 
projections-based algorithm is naturally amenable to an efficient 
distributed implementation and is attractive for modern web- 
scale distributed data mining applications. 

Index Terms —Topic Modeling, Separability, Random Projec¬ 
tion, Solid Angle, Necessary and Sufficient Conditions. 

I. Introduction 

OPIC modeling refers to a family of generative models 
and associated algorithms for discovering the (latent) 
topical structure shared by a large corpus of documents. They 
are important for organizing, searching, and making sense of 
a large text corpus 0 . In this paper we describe a novel 
geometric approach, with provable statistical and computa¬ 
tional efficiency guarantees, for learning the latent topics in 
a document collection. This work is a culmination of a series 
of recent publications on certain structure-leveraging methods 
for topic modeling with provable theoretical guarantees ||2|- 
0 . 

We consider a corpus of M documents, indexed by m = 
1,..., M, each composed of words from a fixed vocabulary of 
size W. The distinct words in the vocabulary are indexed by 
w = 1,..., W. Each document to is viewed as an unordered 
“bag of words” and is represented by an empirical W x 1 
word-counts vector X m , where X w rn is the number of times 
that word w appears in document to 0 , 0 - 0 - The entire 
document corpus is then represented by the W x M matrix 
X = [X^.-.jX^]. Q A “topic” is a IT x 1 distribution 
over the vocabulary. A topic model posits the existence of 
K < min (W,M) latent topics that are shared among all 
M documents in the corpus. The topics can be collectively 
represented by the K columns (3 1 ,..., f3 K of a W x K 
column-stochastic “topic matrix” (3. Each document m is 
conceptually modeled as being generated independently of all 

'When it is clear from the context, we will use X Wi m to represent either the 
empirical word-count or, by suitable column-normalization of X, the empirical 
word-frequency. 


other documents through a two-step process: 1) first draw a 
K x 1 document-specific distribution over topics 0 rn from a 
prior distribution Pr(cc) on the probability simplex with some 
hyper-parameters or, 2) then draw N iid words according 
to a IT x 1 document-specific word distribution over the 
vocabulary given by A m = f3 k @k,m which is a convex 

combination (probabilistic mixture) of the latent topics. Our 
goal is to estimate (3 from the matrix of empirical observations 
X. To appreciate the difficulty of the problem, consider a 
typical benchmark dataset such as a news article collection 
from the New York Times (NYT) 0 that we use in our 
experiments. In this dataset, after suitable pre-processing, 
W = 14, 943, M = 300, 000, and, on average, N = 298. 
Thus, N <^W <C M, X is very sparse, and M is very large. 
Typically, K « 100 <C min(VE, M). 

This estimation problem in topic modeling has been ex¬ 
tensively studied. The prevailing approach is to compute the 
MAP/ML estimate 0. The true posterior of (3 given X, 
however, is intractable to compute and the associated MAP 
and ML estimation problems are in fact NP-hard in the general 
case ED.0D- This necessitates the use of sub-optimal methods 
based on approximations and heuristics such as Variational- 
Bayes and MCMC 0, ITTl-fl3l. While they produce impres¬ 
sive empirical results on many real-world datasets, guarantees 
of asymptotic consistency or efficiency for these approaches 
are either weak or non-existent. This makes it difficult to 
evaluate model fidelity, failure to produce satisfactory results in 
new datasets could be due to the use of approximations and 
heuristics or due to model mis-specihcation which is more 
fundamental. Furthermore, these sub-optimal approaches are 
computationally intensive for large text corpora 0, 0- 

To overcome the hardness of the topic estimation problem 
in its full generality, a new approach has emerged to learn the 
topic model by imposing additional structure on the model 
parameters 0 , 0 , 0 , a, m, os. This paper focuses 
on a key structural property of the topic matrix (3 called 
topic separability 0, 0, 0, [151 wherein every latent topic 
contains at least one word that is novel to it, i.e., the word is 
unique to that topic and is absent from the other topics. This 
is, in essence, a property of the support of the latent topic 
matrix (3. The topic separability property can be motivated 
by the fact that for many real-world datasets, the empirical 
topic estimates produced by popular Variational-Bayes and 
Gibbs Sampling approaches are approximately separable 0, 
0. Moreover, it has recently been shown that the separability 
property will be approximately satisfied with high probability 
when the dimension of the vocabulary W scales sufficiently 
faster than the number of topics K and (3 is a realization 
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of a Dirichlet prior that is typically used in practice ED. 
Therefore, separability is a natural approximation for most 
high-dimensional topic models. 

Our approach exploits the following geometric implication 
of the key separability structure. If we associate each word 
in the vocabulary with a row-vector of a suitably normalized 
empirical word co-occurrence matrix, the set of novel words 
correspond to the extreme points of the convex hull formed 
by the row-vectors of all words. We leverage this geometric 
insight and develop a provably consistent and efficient algo¬ 
rithm. Informally speaking, we establish the following result: 

Theorem 1 . If the topic matrix is separable and the mixing 
weights satisfy a minimum information-theoretically necessary 
technical condition, then our proposed algorithm runs in 
polynomial time in M, IV. N, K, and estimates the topic 
matrix consistently as M —> oo with N > 2 held fixed. 
Moreover, our proposed algorithm can estimate /3 to within 
an e element-wise error with a probability at least 1 — S if 
M >Poly{W,l/N, K. log(l/J), 1/e). 

The asymptotic setting M —> oo with N held fixed is 
motivated by text corpora in which the number of words in a 
single document is small while the number of documents is 
large. We note that our algorithm can be applied to any family 
of topic models whose topic mixing weights prior Pr(a) satis¬ 
fies a minimum information-theoretically necessary technical 
condition. In contrast, the standard Bayesian approaches such 
as Variational-Bayes or MCMC need to be hand-designed 
separately for each specific topic mixing weights prior. 

The highlight of our approach is to identify the novel words 
as extreme points through appropriately defined random pro¬ 
jections. Specifically, we project the row-vector of each word 
in an appropriately normalized word co-occurrence matrix 
along a few independent and isotropically distributed random 
directions. The fraction of times that a word attains the maxi¬ 
mum value along a random direction is a measure of its degree 
of robustness as an extreme point. This process of random 
projections followed by counting the number of times a word 
is a maximizer can be efficiently computed and is robust to 
the perturbations induced by sampling noise associated with 
having only a very small number of words per document N. In 
addition to being computationally efficient, it turns out that this 
random projections based approach (1) requires the minimum 
information-theoretically necessary technical conditions on 
the topic prior for asymptotic consistency, and (2) can be 
naturally parallelized and distributed. As a consequence, it 
can provably achieve the efficiency guarantees of a centralized 
method while requiring insignificant communication between 
distributed document collections 0 . This is attractive for web- 
scale topic modeling of large distributed text corpora. 

Another advance of this paper is the identification of nec¬ 
essary and sufficient conditions on the mixing weights for 
consistent separable topic estimation. In previous work we 
showed that a simplicial condition on the mixing weights is 
both necessary and sufficient for consistently detecting all the 
novel words El- In this paper we complete the characterization 
by showing that an affine independence condition on the 
mixing weights is necessary and sufficient for consistently 


estimating a separable topic matrix. These conditions are 
satisfied by practical choices of topic priors such as the 
Dirichlet distribution 0. All these necessary conditions are 
information-theoretic and algorithm-independent, i.e., they are 
irrespective of the specific statistics of the observations or 
the algorithms that are used. The provable statistical and 
computational efficiency guarantees of our proposed algorithm 
hold true under these necessary and sufficient conditions. 

The rest of this paper is organized as follows. We review 
related work on topic modeling as well as the separability 
property in various domains in Sec. [Tl] We introduce the sepa¬ 
rability property on (3, the simplicial and affine independence 
conditions on mixing weights, and the extreme point geometry 
that motivates our approach in Sec. [Ill] We then discuss how 
the solid angle can be used to identify robust extreme points 
to deal with a finite number of samples (words per document) 
in Sec. m We describe our overall algorithm and sketch 
its analysis in Sec. El We demonstrate the performance of 
our approach in Sec. [vT]on various synthetic and real-world 
examples. Proofs of all results appear in the appendices. 

II. Related Work 

The idea of modeling text documents as mixtures of a few 
semantic topics was first proposed in |[T7| where the mixing 
weights were assumed to be deterministic. Latent Dirichlet 
Allocation (LDA) in the seminal work of 0 extended this 
to a probabilistic setting by modeling topic mixing weights 
using Dirichlet priors. This setting has been further extended 
to include other topic priors such as the log-normal prior 
in the Correlated Topic Model El- LDA models and their 
derivatives have been successful on a wide range of problems 
in terms of achieving good empirical performance 0, El- 

The prevailing approaches for estimation and inference 
problems in topic modeling are based on MAP or ML estima¬ 
tion 0. However, the computation of posterior distributions 
conditioned on observations X is intractable 0 . Moreover, the 
MAP estimation objective is non-convex and has been shown 
to be A/’P-hard 0, ifTOll . Therefore various approximation and 
heuristic strategies have been employed. These approaches 
fall into two major categories - sampling approaches and 
optimization approaches. Most sampling approaches are based 
on Markov Chain Monte Carlo (MCMC) algorithms that 
seek to generate (approximately) independent samples from 
a Markov Chain that is carefully designed to ensure that 
the sample distribution converges to the true posterior El, 
El- Optimization approaches are typically based on the so- 
called Variational-Bayes methods. These methods optimize the 
parameters of a simpler parametric distribution so that it is 
close to the true posterior in terms of KL divergence 0, El- 
Expectation-Maximization-type algorithms are typically used 
in these methods. In practice, while both Variational-Bayes 
and MCMC algorithms have similar performance, Variational- 
Bayes is typically faster than MCMC 0 , m. 

Nonnegative Matrix Factorization (NMF) is an alternative 
approach for topic estimation. NMF-based methods exploit 
the fact that both the topic matrix (3 and the mixing weights 
are nonnegative and attempt to decompose the empirical 
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observation matrix X into a product of a nonnegative topic 
matrix (3 and the matrix of mixing weights by minimizing a 
cost function of the form 123-123]] 

M 

J2d(X m ,(3d m ) + 

m —1 

where d(,) is some measure of closeness and ip is a regulariza¬ 
tion term which enforces desirable properties, e.g., sparsity, on 
/3 and the mixing weights. The NMF problem, however, is also 
known to be non-convex and AfV-haid l24l in general. Sub- 
optimal strategies such as alternating minimization, greedy 
gradient descent, and heuristics are used in practice | 22 | . 

In contrast to the above approaches, a new approach has 
recently emerged which is based on imposing additional 
structure on the model parameters 0, 0, 0, 0, 114-1 . 
El- These approaches show that the topic discovery prob¬ 
lem lends itself to provably consistent and polynomial-time 
solutions by making assumptions about the structure of the 
topic matrix /3 and the distribution of the mixing weights. In 
this category of approaches are methods based on a tensor 
decomposition of the moments of xd, m■ The algorithm 
in ll25l uses second order empirical moments and is shown 
to be asymptotically consistent when the topic matrix j3 has 
a special sparsity structure. The algorithm in El uses the 
third order tensor of observations. It is, however, strongly 
tied to the specific structure of the Dirichlet prior on the 
mixing weights and requires knowledge of the concentration 
parameters of the Dirichlet distribution El- Furthermore, in 
practice these approaches are computationally intensive and 
require some initial coarse dimensionality reduction, gradient 
descent speedups, and GPU acceleration to process large-scale 
text corpora like the NYT dataset lH4l . 

Our work falls into the family of approaches that exploit 
the separability property of f3 and its geometric implications 
0 , a, 0 , a, ca, no, 123 . An asymptotically consistent 
polynomial-time topic estimation algorithm was first proposed 
in 0 - However, this method requires solving W linear pro¬ 
grams, each with W variables and is computationally imprac¬ 
tical. Subsequent work improved the computational efficiency 
El, HI , but theoretical guarantees of asymptotic consistency 
(when N fixed, and the number of documents M oo) 
are unclear. Algorithms in 0 and 0 are both practical and 
provably consistent. Each requires a stronger and slightly 
different technical condition on the topic mixing weights than 
0 - Specifically, 0 imposes a full-rank condition on the 
second-order correlation matrix of the mixing weights and 
proposes a Gram-Schmidt procedure to identify the extreme 
points. Similarly, 0 imposes a diagonal-dominance condition 
on the same second-order correlation matrix and proposes 
a random projections based approach. These approaches are 
tied to the specific conditions imposed and they both fail 
to detect all the novel words and estimate topics when the 
imposed conditions (which are sufficient but not necessary for 
consistent novel word detection or topic estimation) fail to hold 
in some examples 0 - The random projections based algorithm 
proposed in 0 is both practical and provably consistent. 


Furthermore, it requires fewer constraints on the topic mixing 
weights. 

We note that the separability property has been exploited 
in other recent work as well ll26ll . 11271 . In l27l . a singular 
value decomposition based approach is proposed for topic 
estimation. In ||26l , it is shown that the standard Variational - 
Bayes approximation can be asymptotically consistent if (3 
is separable. However, the additional constraints proposed 
essentially boil down to the requirement that each document 
contain predominantly only one topic. In addition to assuming 
the existence of such “pure” documents, 1261 also requires a 
strict initialization. It is thus unclear how this can be achieved 
using only the observations X. 

The separability property has been re-discovered and ex¬ 
ploited in the literature across a number of different fields 
and has found application in several problems. To the best of 
our knowledge, this concept was first introduced as the Pure 
Pixel Index assumption in the Hyperspectral Image unmixing 
problem 1281 . This work assumes the existence of pixels in 
a hyper-spectral image containing predominantly one species. 
Separability has also been studied in the NMF literature in the 
context of ensuring the uniqueness of NMF | 29l . Subsequent 
work has led to the development of NMF algorithms that 
exploit separability E2, (2D- The uniqueness and correctness 
results in this line of work has primarily focused on the 
noiseless case. We finally note that separability has also 
been recently exploited in the problem of learning multiple 
ranking preferences from pairwise comparisons for personal 
recommendation systems and information retrieval gh, ei 
and has led to provably consistent and efficient estimation 
algorithms. 

III. Topic Separability, Necessary and Sufficient 
Conditions, and the Geometric Intuitions 

In this section, we unravel the key ideas that motivate our 
algorithmic approach by focusing on the ideal case where there 
is no “sampling-noise”, i.e., each document is infinitely long 
(N = oo). In the next section, we will turn to the finite N case. 
We recall that (3 and X denote the WxK topic matrix and the 
W X M empirical word counts/frequency matrix respectively. 
Also, M, W, and K denote, respectively, the number of 
documents, the vocabulary size, and the number of topics. For 
convenience, we group the document-specific mixing weights, 
the 8 m ’ s, into a K x M weight matrix 6 = [0 1 , ..., 0 M ] and 
the document-specific distributions, the A m ’s, into a W x M 
document distribution matrix A = [A 1 ,..., A M ]. The gener¬ 
ative procedure that describes a topic model then implies that 
A = [39. In the ideal case considered in this section (N = oo), 
the empirical word frequency matrix X = A. Notation: A 
vector a without specification will denote a column-vector, 
1 the all-ones column vector of suitable size, X* the i-th 
column vector and X 3 the j-th row vector of matrix X, and 
B a suitably row-normalized version (described later) of a 
nonnegative matrix B. Also, [n] := {1,..., n}. 

A. Key Structural Property: Topic Separability 

We first introduce separability as a key structural property 
of a topic matrix (3. Formally, 
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Definition 1. (Separability) A topic matrix (3 £ M WxK is 
separable if for each topic k, there is some word i such that 
f3i'k > 0 and (3 it i = 0, V l ^ k. 

Topic separability implies that each topic contains word(s) 
which appear only in that topic. We refer to these words as 
the novel words of the K topics. Figure Q] shows an example 
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Fig. 1. An example of separable topic matrix (3 (left) and the underlying 
geometric structure (right) of the row space of the normalized document 
distribution matrix A. Note: the word ordering is only for visualization and 
has no bearing on separability. Solid circles represent rows of A. Empty 
circles represent rows of X when N is finite (in the ideal case, A = X). 
Projections of A^’s (resp. X^’s) along a random isotropic direction d can 
be used to identify novel words. 

of a separable f3 with K = 3 topics. Words 1 and 2 are novel 
to topic 1, words 3 and 4 to topic 2, and word 5 to topic 
3. Other words that appear in multiple topics are called non¬ 
novel words (e.g., word 6). Identifying the novel words for K 
distinct topics is the key step of our proposed approach. 

We note that separability has been empirically observed 
to be approximately satisfied by topic estimates produced by 
Variational-Bayes and MCMC based algorithms 0, 0, l26l . 
More fundamentally, in very recent work mi, it has been 
shown that topic separability is an inevitable consequence of 
having a relatively small number of topics in a very large 
vocabulary (high-dimensionality). In particular, when the K 
columns (topics) of f3 are independently sampled from a 
Dirichlet distribution (on a (W — 1)-dimensional probability 
simplex), the resulting topic matrix (3 will be (approximately) 
separable with probability tending to 1 as IT scales to infinity 
sufficiently faster than K. A Dirichlet prior on /3 is widely- 
used in smoothed settings of topic modeling 0. 

As we will discuss next in Sec. IIII-Cl the topic separability 
property combined with additional conditions on the second- 
order statistics of the mixing weights leads to an intuitively 
appealing geometric property that can be exploited to develop 
a provably consistent and efficient topic estimation algorithm. 

B. Conditions on the Topic Mixing Weights 

Topic separability alone does not guarantee that there will 
be a unique f3 that is consistent with all the observations 
X. This is illustrated in Fig. [2] 0. Therefore, in an effort 
to develop provably consistent topic estimation algorithms, 
a number of different conditions have been imposed on the 
topic mixing weights 6 in the literature 0 , 0 , 0 , 0 , ca. 


Complementing the work in 0 which identifies necessary and 
sufficient conditions for consistent detection of novel words, in 
this paper we identify necessary and sufficient conditions for 
consistent estimation of a separable topic matrix. Our necessity 
results are information-theoretic and algorithm-independent 
in nature, meaning that they are independent of any specific 
statistics of the observations and the algorithms used. The 
novel words and the topics can only be identified up to a 
permutation and this is accounted for in our results. 

Let a := E (G m ) and R := E(6> m 0 mT ) be the K x 1 
expectation vector and the K x K correlation matrix of the 
weight prior Pr(a). Without loss of generality, we can assume 
that the elements of a are strictly positive since otherwise 
some topic(s) will not appear in the corpus. A key quantity 
is R := diag(a) _ 1 Rdiag(a ) _1 which may be viewed as a 
“normalized” second-moment matrix of the weight vector. The 
following conditions are central to our results. 

Condition 1. (Simplicial Condition) A matrix B is (row¬ 
wise) 'y s -simplicial if any row-vector of B is at a Euclidean 
distance of at least 7 S > 0 from the convex hull of the 
remaining row-vectors. A topic model is 'y s -simplicial if its 
normalized second-moment R is y s -simplicial. 

Condition 2. (Affine-Independence) A matrix B is (row¬ 
wise) "f a -ajfine-independent if min A || J2k=i || 2 / 1 | A || 2 > 
7 o > 0, where B^ is the k-th row of B and the minimum 
is over all A £ R A such that A f 0 and A& = 0. A 

topic model is ya-afftne-independent if its normalized second- 
moment R is 'ya-afftne-independent. 

Here, 7 S and 7 ,, are called the simplicial and affine- 
independence constants respectively. They are condition num¬ 
bers which measure the degree to which the conditions that 
they are respectively associated with hold. The larger that these 
condition numbers are, the easier it is to estimate the topic 
matrix. Going forward, we will say that a matrix is simplicial 
(resp. affine independent) if it is 7 ,,-simp I icial (resp. y a - 
affine-independent) for some 7 s > 0 (resp. 7 a > 0). The 
simplicial condition was first proposed in (9) and then further 
investigated in 0 - This paper is the first to identify affine- 
independence as both necessary and sufficient for consistent 
separable topic estimation. Before we discuss their geometric 
implications, we point out that affine-independence is stronger 
than the simplicial condition: 

Proposition 1. R is y a -affine - in depen den t =7 R is at least 
y a -simplicial. The reverse implication is false in general. 

The Simplicial Condition is both Necessary and Sufficient 
for Novel Word Detection: We first focus on detecting all 
the novel words of the K distinct topics. For this task, the 
simplicial condition is an algorithm-independent, information- 
theoretic necessary condition. Formally, 

Lemma 1. (Simplicial Condition is Necessary for Novel Word 
Detection 0 Lemma 1]) Let (3 be separable and W > K. 
If there exists an algorithm that can consistently identify all 
novel words of all K topics from X, then R is simplicial. 

The key insight behind this result is that when R is non- 
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Fig. 2. Example showing that topic separability alone does not guarantee a unique solution to the problem of estimating from X. Here, ft\0 = B 2 O = A 
is a document distribution matrix that is consistent with two different topic matrices /3W and ++' that are both separable. 


simplicial, we can construct two distinct separable topic matri¬ 
ces with different sets of novel words which induce the same 
distribution on the empirical observations X. Geometrically, 
the simplicial condition guarantees that the K rows of R will 
be extreme points of the convex hull that they themselves form. 
Therefore, if R is not simplicial, there will exist at least one 
redundant topic which is just a convex combination of the 
other topics. 

It turns out that R being simplicial is also sufficient for 
consistent novel word detection. This is a direct consequence 
of the consistency guarantees of our approach as outlined in 
Theorem [3] 

Affine-Independence is Necessary and Sufficient for Sep¬ 
arable Topic Estimation: We now focus on estimating 
a separable topic matrix (3, which is a stronger require¬ 
ment than detecting novel words. It naturally requires condi¬ 
tions that are stronger than the simplicial condition. Affine- 
independence turns out to be an algorithm-independent, 
information-theoretic necessary condition. Formally, 

Lemma 2. (Affine-Independence is Necessary for Separable 
Topic Estimation) Let (3 be separable with W > 2 +K. If there 
exists an algorithm that can consistently estimate (3 from X, 
then its normalized second-moment R is affine-independent. 

Similar to Lemma [T] if R is not affine-independent, we 
can construct two distinct and separable topic matrices that 
induce the same distribution on the observation which makes 
consistent topic estimation impossible. Geometrically, every 
point in a convex set can be decomposed uniquely as a 
convex combination of its extreme points, if, and only if, 
the extreme points are affine-independent. Hence, if R is 
not affine-independent, a non-novel word can be assigned to 
different subsets of topics. 

The sufficiency of the affine-independence condition in 
separable topic estimation is again a direct consequence of 
the consistency guarantees of our approach as in Theorems [3] 
and [4] We note that since affine-independence implies the 
simplicial condition (Proposition [I]), affine-independence is 
sufficient for novel word detection as well. 

Connection to Other Conditions on the Mixing Weights: 
We briefly discuss other conditions on the mixing weights 
0 that have been exploited in the literature. In 0, 1131 , R 
(equivalently R) is assumed to have full-rank (with minimum 
eigenvalue y T > 0). In 0, R is assumed to be diagonal- 
dominant, i.e., Vi, j, i ^ j, Rj^—Rjj > 7 d > 0. They are both 
sufficient conditions for detecting all the novel words of all 
distinct topics. The constants y r and -/,i are condition numbers 
which measure the degree to which the full-rank and diagonal- 


dominance conditions hold respectively. They are counterparts 
of 7 s and 7,, and like them, the larger they are, the easier 
it is to consistently detect the novel words and estimate (3. 
The relationships between these conditions are summarized in 
Proposition |2] and illustrated in Fig. [3] 



Fig. 3. Relationships between Simplicial, Affine-Independence, Full Rank, 
and Diagonal Dominance conditions on the normalized second-moment R. 

Proposition 2. Let R be the normalized second-moment of 
the topic prior. Then, 

1) R is full rank with minimum eigenvalue 7,- =+ R is 
at least "f r -affine-independent => R is at least r ) r - 
simplicial. 

2) R is 'fd-diagonal-dominant =>• R is at least q id- 
simplicial. 

3) R being diagonal-dominant neither implies nor is im¬ 
plied by R being affine-independent (or full-rank). 

We note that in our earlier work 0, the provable guarantees 
for estimating the separable topic matrix require R to have 
full rank. The analysis in this paper provably extends the 
guarantees to the affine-independence condition. 

C. Geometric Implications and Random Projections Based 
Algorithm 

We now demonstrate the geometric implications of topic 
separability combined with the simplicial/ affine-independence 
condition on the topic mixing weights. To highlight the key 
ideas we focus on the ideal case where N = oo. Then, the 
empirical document word-frequency matrix X = A = (39. 
Novel Words are Extreme Points: To expose the underlying 
geometry, we normalize the rows of A and 0 to obtain 
row-stochastic matrices A := diag(Al) _1 A and 0 := 
diag(01) _1 0. Then since A = (38 , we have A = /3 9 where 
/3 := diag(Al) _1 /3diag(01) is a row-normalized “topic 
matrix” which is both row-stochastic and separable with the 
same sets of novel words as (3. 

Now consider the row vectors of A and 0. First, it can 
be shown that if R is simplicial (cf. Condition Q} then, with 
high probability, no row of 0 will be in the convex hull of 
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the others (see Appendix |D). Next, the separability property 
ensures that if w is a novel word of topic k, then f3 wk = 1 and 
3 WJ = 0 \/j A k so that A w = 6 k . Revisiting the example in 
Fig. m the rows of A which correspond to novel words, e.g., 
words 1 through 5, are all row-vectors of 6 and together form a 
convex hull of K extreme points. For example, Ai = A 2 = 0 1 
and A 3 = A 4 = 6>. If, however, w is a non-novel word, then 
A w = '22 k BwkQk lives inside the convex hull of the rows of 
6. In Fig. [T] row A 6 which corresponds to non-novel word 6 , 
is inside the convex hull of 61,62,63. In summary, the novel 
words can be detected as extreme points of all the row-vectors 
of A. Also, multiple novel words of the same topic correspond 
to the same extreme point (e.g., Ai = A 2 = 6 \). Formally, 

Lemma 3. Let R be 7,, simplicial and (3 be separable. Then, 
with probability at least 1—2K exp(—Ci M)— exp(— C 2 M), the 
i-th row of A is an extreme point of the convex hull spanned 
by all the rows of A. if and only if word i is novel. Here the 
constant ci := 7^a^ lin /4A max and c 2 := 7 a a^ in /2A^ ax . The 
model parameters are defined as follows. a m i n is the minimum 
element of a. A max is the maximum singular-value of R. 

To see how identifying novel words can help us estimate 
/3, recall that the row-vectors of A corresponding to novel 
words coincide with the rows of 6. Thus 6 is known once 
one novel word for each topic is known. Also, for all words 
w, A w = '}f jk 3wk^k- Thus, if we can uniquely decompose 
A w as a convex combination of the extreme points, then the 
coefficients of the decomposition will give us the ic-th row 
of /3. A unique decomposition exists with high probability 
when R is affine-independent and can be found by solving a 
constrained linear regression problem. This gives us /3. Finally, 
noting that diag(Al)/3 = f3 diag(01), /3 can be recovered by 
suitably renormalizing rows and then columns of /3. To sum 
up. 

Lemma 4. Let A and one novel word per distinct topic be 
given. If R is affine-independent, then, with probability at 
least 1 — 2 K exp(— C\M) — exp(— C 2 M), /3 can be recovered 
uniquely via constrained linear regression. Here the constant 
Cl := 7a«min/ 4A max and c 2 := 7a<4nn/ 2A max- The model 
parameters are defined as follows. a m ; n is the minimum 
element of a. A max is the maximum singular-value of R. 

Lemmas [3] and [4] together provide a geometric approach 
for learning (3 from A (equivalently A): (1) Find extreme 
points of rows of A. Cluster the rows of A that correspond 
to the same extreme point into the same group. (2) Express 
the remaining rows of A as convex combinations of the K 
distinct extreme points. (3) Renormalize /3 to obtain f3. 
Detecting Extreme Points using Random Projections: A key 
contribution of our approach is an efficient random projections 
based algorithm to detect novel words as extreme points. 
The idea is illustrated in Fig. Q] if we project every point 
of a convex body onto an isotropically distributed random 
direction d, the maximum (or minimum) projection value 
must correspond to one of the extreme points with probability 
1. On the other hand, the non-novel words will not have 
the maximum projection value along any random direction. 
Therefore, by repeatedly projecting all the points onto a few 


isotropically distributed random directions, we can detect all 
the extreme points with very high probability as the number of 
random directions increase. An explicit bound on the number 
of projections needed appears in Theorem [3] 

Finite N in Practice: The geometric intuition discussed above 
was based on the row-vectors of A. When N = 00, A = X 
the matrix of row-normalized empirical word-frequencies of 
all documents. If N is finite but very large, A can be well- 
approximated by X thanks to the law of large numbers. 
However, in real-word text corpora, N <C W (e.g., N = 298 
while W = 14,943 in the NYT dataset). Therefore, the row- 
vectors of X are significantly perturbed away from the ideal 
rows of A as illustrated in Fig. Q] We discuss the effect of 
small N and how we address the accompanying issues next. 

IV. Topic Geometry with Finite Samples: Word 
Co-occurrence Matrix Representation, Solid 
Angle, and Random Projections based approach 

The extreme point geometry sketched in Sec. IIII-CI is per¬ 
turbed when N is small as highlighted in Fig. |T] Specifically, 
the rows of the empirical word-frequency matrix X deviate 
from the rows of A. This creates several problems: (1) points 
in the convex hull corresponding to non-novel words may 
also become “outlier” extreme points (e.g., X f ; in Fig. [TJ; (2) 
some extreme points that correspond to novel words may no 
longer be extreme (e.g., X 3 in Fig. [TJ ; (3) multiple novel 
words corresponding to the same extreme point may become 
multiple distinct extreme points (e.g., X x and X 2 in Fig. QJ. 
Unfortunately, these issues do not vanish as M increases with 
N fixed - a regime which captures the characteristics of typical 
benchmark datasets - because the dimensionality of the rows 
(equal to M ) also increases. There is no “averaging” effect to 
smoothen-out the sampling noise. 

Our solution is to seek a new representation, a statistic of 
X, which can not only smoothen out the sampling noise of in¬ 
dividual documents, but also preserve the same extreme point 
geometry induced by the separability and affine independence 
conditions. In addition, we also develop an extreme point 
robustness measure that naturally arises within our random 
projections based framework. This robustness measure can be 
used to detect and exclude the “outlier” extreme points. 

A. Normalized Word Co-occurrence Matrix Representation 

We construct a suitably normalized word co-occurrence 
matrix from X as our new representation. The co-occurrence 
matrix converges almost surely to an ideal statistic as M —> 00 
for any fixed N > 2. Simultaneously, in the asymptotic limit, 
the original novel words continue to correspond to extreme 
points in the new representation and overall extreme point 
geometry is preserved. 

The new representation is (conceptually) constructed as 
follows. First randomly divide all the words in each document 
into two equal-sized independent halves and obtain two WxK 
empirical word-frequency matrices X and X' each containing 
N/2 words. Then normalize their rows like in Sec. IIII-CI to 
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obtain X and X' which are row-stochastic. The empirical word apply the same row and column renormalization to obtain (3. 
co-occurrence matrix of size W x W is then given by The following result is the counterpart of Lemma Q] for E: 


E := MX'X t (1) 

We note that in our random projection based approach, 
E is not explicitly constructed by multiplying X 7 and X. 
Instead, we keep X' and X and exploit their sparsity properties 
to reduce the computational complexity of all subsequent 
processing. 

Asymptotic Consistency: The first nice property of the word 
co-occurrence representation is its asymptotic consistency 
when N is fixed. As the number of documents M —> oo, 
the empirical E converges, almost surely, to an ideal word 
co-occurrence matrix E of size W x W. Formally, 

Lemma 5. ( h32\ Lemma 2]) Let E be the empirical word 
co-occurrence matrix defined in Eq. £Qt- Then, 

E--» /3R/3 T =: E (2) 

almost surely 


Lemma 7. Let E and one novel word for each distinct topic 
be given. If R is affine-independent, then (3 can be recovered 
uniquely via constrained linear regression. 

One can follow the same steps as in the proof of Lemma Q] 
The only additional step is to check that R/3 T = [R, RB] is 
affine-independent if R is affine-independent. 

We note that the finite sampling noise perturbation E — E 
is still not 0 but vanishes as M —> oo (in contrast to the 
X representation in Sec. Illl-Cb . However, there is still a 
possibility of observing “outlier” extreme points if a non-novel 
word lies on the facet of the convex hull of the rows of E. 
We next introduce an extreme point robustness measure based 
on a certain solid angle that naturally arises in our random 
projections based approach, and discuss how it can be used to 
detect and distinguish between “true” novel words and such 
“outlier” extreme points. 


where (3 := diag 1 (/3a)/3 diag(a) and R := 

diag^ 1 (a)Rdiag _ 1 (a). Furthermore, if q := 
mini<i<vK(/3a)j > 0, then Pr(||E - Ejloo > e) < 
8 W 2 exp(-e VMiV/20). 

Here R is the same normalized second-moment of the topic 
priors as defined in Sec. [Ill] and (3 is a row-normalized version 
of (3. We make note of the abuse of notion for (3 which was 
defined in Sec. IIII-CI It can be shown that the /3 defined in 
Lemma [5] is the limit of the one defined in Sec. lIII-Cl as M —> 
oo. The convergence result in Lemma [5] shows that the word 
co-occurrence representation E can be consistently estimated 
by E as M —> oo and the deviation vanishes exponentially in 
M which is large in typical benchmark datasets. 

Novel Words are Extreme Points: Another reason for using 
this word co-occurrence representation is that it preserves 
the extreme point geometry. Consider the ideal word co¬ 
occurrence matrix E = /3(R/3 T ). It is straightforward to show 
that if /3 is separable and R is simplicial then (R/3 T ) is 
also simplicial. Using these facts it is possible to establish 
the following counterpart of Lemma [3] for E: 

Lemma 6. (Novel Words are Extreme Points (0 Lemma 1]) 
Let R be simplicial and (3 be separable. Then, a word i is 
novel if, and only if, the i-th row of E is an extreme point of 
the convex hull spanned by all the rows of E. 

In another words, the novel words correspond to the extreme 
points of all the row-vectors of the ideal word co-occurrence 
matrix E. Consider the example in Fig. 0 which is based on the 
same topic matrix (3 as in Fig. Q] Here, Ei = E 2 ,E 3 = E 4 , 
and E 5 are K = 3 distinct extreme points of all row-vectors 
of E and Eg, which corresponds to a non-novel word, is inside 
the convex hull. 

Once the novel words are detected as extreme points, we 
can follow the same procedure as in Lemma [4] and express 
each row E„ of E as a unique convex combination of the I\ 
extreme rows of E or equivalently the rows of (R/3 T ). The 
weights of the convex combination are the j3 w k s. We can then 


B. Solid Angle Extreme Point Robustness Measure 

To handle the impact of a small but nonzero perturbation 
||E—E||oo, we develop an extreme point “robustness” measure. 
This is necessary for not only applying our approach to real- 
world data but also to establish finite sample complexity 
bounds. Intuitively, a robustness measure should be able to 
distinguish between the “true” extreme points (row vectors 
that are novel words) and the “outlier” extreme points (row 
vectors of non-novel words that become extreme points due to 
the nonzero perturbation). Towards this goal, we leverage a key 
geometric quantity, namely, the Normalized Solid Angle sub¬ 
tended by the convex hull of the rows of E at an extreme point. 
To visualize this quantity, we revisit our running example in 
Fig. a and indicate the solid angles attached to each extreme 
point by the shaded regions. It turns out that this geometric 
quantity naturally arises in the context of random projections 
that was discussed earlier. To see this connection, in Fig. Q] 
observe that the shaded region attached to any extreme point 
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Fig. 4. An example of separable topic matrix (3 (left) and the underlying 
geometric structure (right) in the word co-occurrence representation. Note: 
the word ordering is only for visualization and has no bearing on separability. 
The example topic matrix (3 is the same as in Fig. [7] Solid circles represent 
the rows of E. The shaded regions depict the solid angles subtended by each 
extreme point. di,d. 2 ,d 3 are isotropic random directions along which each 
extreme point has maximum projection value. They can be used to estimate 
the solid angles. 











coincides precisely with the set of directions along which its 
projection is larger (taking sign into account) than that of 
any other point (whether extreme or not). For example, in 
Fig. [4] the projection of Ej = E 2 along d x is larger than that 
of any other point. Thus, the solid angle attached to a point 
E, (whether extreme or not) can be formally defined as the 
set of directions {d : \fj : Ey ^ Ei,(Ei,d) > <E,,d)}. 
This set is nonempty only for extreme points. The solid angle 
defined above is a set. To derive a scalar robustness measure 
from this set and tie it to the idea of random projections, we 
adopt a statistical perspective and define the normalized solid 
angle of a point as the probability that the point will have the 
maximum projection value along an isotropically distributed 
random direction. Concretely, for the v'-th word (row vector), 
the normalized solid angle q t is defined as 

qi ■= Pr(V) : E : ± E i; (E t , d) > (E j: d)) (3) 

where d is drawn from an isotropic distribution in 1 R M ' such 
as the spherical Gaussian. The condition E, ^ E j in Eq. (0 
is introduced to exclude the multiple novel words of the same 
topic that correspond to the same extreme point. For instance, 
in Fig. [4] E x = E 2 , Hence, for , j = 2 is excluded. To make 
it practical to handle finite sample estimation noise we replace 
the condition Ej ^ E, by the condition ||Ej — Ej|| > £ for 
some suitably defined £. 

As illustrated in Fig. [I] the solid angle for all the extreme 
points are strictly positive given R is 7 s -simplicial. On the 
other hand, for i that is non-novel, the corresponding solid 
angle qi is zero by definition. Hence the extreme point 
geometry in Lemma [ 6 ] can be re-expressed in term of solid 
angles as follows: 

Lemma 8. (Novel Words have Positive Solid Angles) Let R be 
simplicial and /3 be separable. Then, word i is a novel word 
if, and only if, qi > 0 . 

We denote the smallest solid angle among the K distinct 
extreme points by q/ > 0. This is a robust condition number 
of the convex hull formed by the rows of E and is related to 
the simplicial constant 7 S of R. 

In a real-world dataset we have access to only an empirical 
estimate E of the ideal word co-occurrence matrix E. If we 
replace E with E, then the resulting empirical solid angle 
estimate % will be very close to the ideal qi if E is close 
enough to E. Then, the solid angles of “outlier” extreme points 
will be close to 0 while they will be bounded away from zero 
for the “true” extreme points. One can then hope to correctly 
identify all K extreme points by rank-ordering all empirical 
solid angle estimates and selecting the K distinct row-vectors 
that have the largest solid angles. This forms the basis of our 
proposed algorithm. The problem now boils down to efficiently 
estimating the solid angles and establishing the asymptotic 
convergence of the estimates as M —> 00. We next discuss 
how random projections can be used to achieve these goals. 


C. Efficient Solid Angle Estimation via Random Projections 

The definition of the normalized solid angle in Eq. 0 
motivates an efficient algorithm based on random projections 


to estimate it. For convenience, we first rewrite Eq. 0 


as 


qi = E 


I{Vj : ||Ej - Ei|| > £, E,d > Ejd} 


( 4 ) 


and then propose to estimate it by 


Qi — -p ^2 1 (Vj : Ei^i + Ejj — 2 Eij > £/ 2 , 
r—1 

EjcT > E,cT) (5) 

where d 1 ,..., d p £ R v ^ xl are P iid directions drawn from 
an isotropic distribution in R” . Algorithmically, by Eq. 0. 
we approximate the solid angle qi at the z-th word (row-vector) 
by first projecting all the row-vectors onto P iid isotropic 
random directions and then calculating the fraction of times 
each row-vector achieves the maximum projection value. It 
turns out that the condition E i:i + Ejj — 2Eij > £/2 is 
equivalent to ||E, — Ej || > £ in terms of its ability to exclude 
multiple novel words from the same topic and is adopted for 
its simplicity. @ 

This procedure of taking random projections followed by 
calculating the number of times a word is a maximizer via 
Eq. 0 provides a consistent estimate of the solid angle in 
Eq. 0 as M —> 00 and the number of projections P increases. 
The high-level idea is simple: as P increases, the empirical 
average in Eq. 0 converges to the corresponding expectation. 
Simultaneously, as M increases, E E. Overall, the 

approximation qi proposed in Eq 0 using random projections 
converges to q j. 

This random projections based approach is also computa¬ 
tionally efficient for the following reasons. First, it enables us 
to avoid the explicit construction of the W x W dimensional 
matrix E: Recall that each column of X and X' has no 
more than N -C W nonzero entries. Hence X and X' are 
both sparse. Since Ed = MX'(X T d), the projection can 
be calculated using two sparse matrix-vector multiplications. 
Second, it turns out that the number of projections P needed 
to guarantee consistency is small. In fact in Theorem [3] we 
provide a sufficient upper bound for P which is a polynomial 
function of log(VF), log(l/<5) and other model parameters, 
where 5 is the probability that the algorithm fails to detect all 
the distinct novel words. 

Parallelization, Distributed and Online Settings: Another 
advantage of the proposed random projections based approach 
is that it can be parallelized and is naturally amenable to 
online or distributed settings. This is based on the following 
observation that each projection has an additive structure: 

M 

Ed r = MX'X T d r = MJ2 X m, X mT d r . 

m =1 

The P projections can also be computed independently. There¬ 
fore, 

• In a distributed setting in which the documents are stored 
on distributed servers, we can first share the same random 


2 We abuse the symbol £ by using it to indicate different thresholds in these 
conditions. 
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directions across servers and then aggregate the projection 
values. The communication cost is only the “partial” 
projection values and is therefore insignificant 0 and 
does not scale as the number of observations N, M 
increases. 

• In an online setting in which the documents are streamed 
in an online fashion 11201 . we only need to keep all the 
projection values and update the projection values (hence 
the empirical solid angle estimates) when new documents 
arrive. 

The additive and independent structure guarantees that the 
statistical efficiency of these variations are the same as the 
centralized “batch” implementation. For the rest of this paper, 
we only focus on the centralized version. 

Outline of Overall Approach: Our overall approach can 
be summarized as follows. (1) Estimate the empirical solid 
angles using P iid isotropic random directions as in Eq. 0 
(2) Select the K words with distinct word co-occurrence 
patterns (rows) that have the largest empirical solid angles. (3) 
Estimate the topic matrix using constrained linear regression 
as in Lemma [4] We will discuss the details of our overall 
approach in the next section and establish guarantees for its 
computational and statistical efficiency. 

V. Algorithm and Analysis 

Algorithm[j]describes the main steps of our overall random 
projectons based algorithm which we call RP. The two main 
steps, novel word detection and topic matrix estimation are 
outlined in Algorithms [2] and [3] respectively. Algorithm [2] 
outlines the random projection and rank-ordering steps. Al¬ 
gorithm [3] describes the constrained linear regression and the 
renormalization steps in a combined way. 

Algorithm 1 RP 

Input: Text documents X, X' ( IV x M); Number of topics 
K\ Number of iid random projections P; Tolerance pa¬ 
rameters e > 0 . 

Output: Estimate of the topic matrix 0(W x K). 

1 : Set of Novel Words 1 ^NovelWordDetect(X, X', K. P , Q 

2 : 0 -s—Estimate 1 Topics(I, X, X', e) 


Computational Efficiency: We first summarize the computa¬ 
tional efficiency of Algorithm Q] 

Theorem 2. Let the number of novel words for each topic 
be a constant relative to M , W, N. Then, the running time of 
Algorithm [7] is 0{MNP + WP + WK 3 ). 

This efficiency is achieved by exploiting the sparsity of 
X and the property that there are only a small number of 
novel words in a typical vocabulary. A detailed analysis of the 
computational complexity is presented in the appendix. Here 
we point out that in order to upper bound the computation time 
of the linear regression in Algorithm [3] we used O (WK 3 ) 
for W matrix inversions, one for each of the words in the 
vocabulary. In practice, a gradient descent implementation 
can be used for the constrained linear regression which is 


Algorithm 2 NovelWordDetect (via Random Projections) 
Input: X,X'; Number of topics K; Number of projections 
P; Tolerance (; 

Output: The set of all novel words of K distinct topics I. 
l: ® «- 0, Mi = 1,..., W, E «— MX'X t . 

2: for all r = 1,..., P do 

3: Sample d r G ffi" from an isotropic prior. 

4: V <- MX'X T cT 

5: i* <r- argmaxi<j<rvyi, Qiff~ Qi* + 1/-P 

6: Jj* G- {j : Pi*+ Ej j — 2Ei* j > C/2} 

7: for all k € Jf do 

8: Jk {j ■ Ek,k + Ejj — 2Ek,j > C/ 2 } 

9: if {Vj G Jk,v k > Vj} then 

10: Qk ^ Qk T 1/P 

ll: end if 

12: end for 

13: end for 

14: J <- 0, k <- 0, j <- 1 

15: while k < K do 

16: i <— index of the j th largest value of {gi,..., qw}- 

17: if {Vp G 1, E ViP + P M - 2Pi iP > C/2} then 

IB: I<-IU {i}, k <— k + 1 

19: end if 

20: j <- j + 1 

21: end while 
22: Return 1. 


Algorithm 3 EstimateTopics 

Input: T = {ii,... ,ix} set of novel words, one for each of 
the K topics; E; precision parameter e 
Output: 0 , which is the estimate of the (3 matrix 

1 : = E UJ? i 1 , . . . , E w,ix 

2 : Y = (E*// . . . , E*J) T 

3: for all z = 1, W do 

4: Solve b* := argmin b ||E* — bY|| 2 

5 : subject to bj > 0, = 1 

6 : using precision e for the stopping-criterion. 

7 : 0 i <— (jgXjl)b* 

8: end for 

9: 0 ^—column normalize 0 


much more efficient. We also note that these W optimization 
problems are decoupled given the set of detected novel words. 
Therefore, they can be parallelized in a straightforward manner 
0 . 

Asymptotic Consistency and Statistical Efficiency: We now 

summarize the asymptotic consistency and sample complexity 
bounds for Algorithm |T] The analysis is a combination of the 
consistency of the novel word detection step (Algorithmic and 
the topic estimation step (Algorithm 0. We state the results 
for both of these steps. First, for detecting all the novel words 
of the K distinct topics, we have the following result: 

Theorem 3. Let topic matrix (3 be separable and R be 7 - 
simplicial. If the projection directions are iid sampled from 
any isotropic distribution, then Algorithm [2] can identify all 
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the novel words of the K distinct topics as M, P —> oo. 
Furthermore, \/S > 0, if 


( 6 , 


N p 2 rj A 


9a 


then Algorithm \ 2\fails with probability at most S. The model 
parameters are defined as follows, p = min{ | } where 
d = (1 — b) 2 7 2 /A max , c ?2 = (1 — b) 7 , A max is the maximum 
eigenvalue of R, b = maxj 6 c 0 ,fe @j,k> and Co is the set of 
non-novel words. Finally, is the minimum solid angle of 
the extreme points of the convex hull of the rows of E. 


The detailed proof is presented in the appendix. The results 
in Eq. ([6]) provide a sufficient finite sample complexity bound 
for novel word detection. The bound is polynomial with 
respect to M, W, K, N, log(d) and other model parameters. 
The number of projections P that impacts the computational 
complexity scales as log (W)/q\ in this sufficient bound where 
< 7 A can be upper bounded by 1 / K. In practice, we have found 
that setting P = O(K) is a good choice J5J. 

We note that the result in Theorem [3] only requires the 
simplicial condition which is the minimum condition required 
for consistent novel word detection (Lemma [!}. This theorem 
holds true if the topic prior R satisfies stronger conditions 
such as affine-independence. We also point out that our proof 
in this paper holds for any isotropic distribution on the random 
projection directions d 1 ,..., d p . The previous result in GO, 
however, only applies to some specific isotopic distributions 
such as the Spherical Gaussian or the uniform distribution 
in a unit ball. In practice, we use Spherical Gaussian since 
sampling from such prior is simple and requires only 0(W) 
time for generating each random direction. 

Next, given the successful detection of the set of novel 
words for all topics, we have the following result for the 
accurate estimation of the separable topic matrix (3: 

Theorem 4. Let topic matrix (3 be separable and R he <y a - 
affine-independent. Given the successful detection of novel 
words for all K distinct topics, the output of Algorithm [2] 
f3 A (3 element-wise (up to a column permutation). Specifi¬ 
cally, if 

2560W 2 Klog(W 4 K/S) 

then Vi,k, f3i y k will be e close to with probability at least 
1 — S, for any 0 < e < 1 . t] is the same as in Theorem [2 a m in 
is the minimum value in a. 


We note that the sufficient sample complexity bound in 
Eq. 0 is again polynomial in terms of all the model pa¬ 
rameters. Here we only require R to be affine-independent. 
Combining Theorem [3] and Theorem Q] gives the consistency 
and sample complexity bounds of our overall approach in 
Algorithm [j] 


VI. Experimental Results 

In this section, we present experimental results on both syn¬ 
thetic and real world datasets. We report different performance 
measures that have been commonly used in the topic modeling 


literature. When the ground truth is available (Sec. IVI-Ab . 
we use the l\ reconstruction error between the ground truth 
topics and the estimates after proper topic alignment. For 
the real-world text corpus in Sec. IVI-BI we report the held- 
out probability, which is a standard measure used in the 
topic modeling literature. We also qualitatively (semantically) 
compare the topics extracted by the different approaches using 
the top probable words for each topic. 

A. Semi-synthetic text corpus 

In order to validate our proposed algorithm, we generate 
“semi-synthetic” text corpora by sampling from a synthetic, 
yet realistic, ground truth topic model. To ensure that the 
semi-synthetic data is similar to real-world data, in terms of 
dimensionality, sparsity, and other characteristics, we use the 
following generative procedure adapted from |[5ll, 171. 

We first train an LDA model (with K = 100) on a real- 
world dataset using a standard Gibbs Sampling method with 
default parameters (as described in El, ED) to obtain a topic 
matrix (3q of size W x K. The real-world dataset that we use 
to generate our synthetic data is derived from a New York 
Times (NYT) articles dataset JS]. The original vocabulary is 
first pruned based on document frequencies. Specifically, as 
is standard practice, only words that appear in more than 
500 documents are retained. Thereafter, again as per standard 
practice, the words in the so-called stop-word list are deleted 
as recommended in 1341 . After these steps, M = 300,000, 
W = 14, 943, and the average document length N = 298. We 
then generate semi-synthetic datasets, for various values of M, 
by fixing N = 300 and using (3q and a Dirichlet topic prior. 
As suggested in CD and used in a, q, we use symmetric 
hyper-parameters (0.03) for the Dirichlet topic prior. 

The W x K topic matrix /3 q may not be separable. To 
enforce separability, we create a new separable (W + K) x K 
dimensional topic matrix (3 sep by inserting K synthetic novel 
words (one per topic) having suitable probabilities in each 
topic. Specifically, (3 sep is constructed by transforming j3o as 
follows. First, for each synthetic novel word in /3 sep , the value 
of the sole nonzero entry in its row is set to the probability 
of the most probable word in the topic (column) of /3o for 
which it is a novel word. Then the resulting (W + K ) x 
K dimensional nonnegative matrix is renormalized column¬ 
wise to make it column-stochastic. Finally, we generate semi¬ 
synthetic datasets, for various values of M, by fixing N = 300 
and using /3 sep and the same symmetric Dirichlet topic prior 
used for (3 0 . 

We use the name Semi-Syn to refer to datasets that are 
generated using (3q and the name Semi-Syn+Novel for datasets 
generated using /3 sep . 

In our proposed random projections based algorithm, which 
we call RP, we set P = 150 x K, ( = 0.05, and e = 10 _ 4. 
We compare RP against the provably efficient algorithm 
RecoverL2 in 0 and the standard Gibbs Sampling based 
LDA algorithm (denoted by Gibbs) in ifTD . H33| . In order 
to measure the performance of different algorithms in our 
experiments based on semi-synthetic data, we compute the 
t\ norm of the reconstruction error between f3 and (3. Since 
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all column permutations of a given topic matrix correspond 
to the same topic model (for a corresponding permutation of 
the topic mixing weights), we use a bipartite graph matching 
algorithm to optimally match the columns of (3 with those of 
f3 (based on minimizing the sum of l-\ distances between all 
pairs of matching columns) before computing the l\ norm of 
the reconstruction error between j3 and (3. 

The results on both Semi-Syn+Novel NYT and Semi-Syn 
NYT are summarized in Fig. [5] for all three algorithms for 
various choices of the number of documents M. We note that 
in these figures the l\ norm of the error has been normalized 
by the number of topics (K = 100). 



Fig. 5. i\ norm of the error in estimating the topic matrix (3 for various 
M (K = 100): (Top) Semi-Syn+Novel NYT; (Bottom) Semi-Syn NYT. RP 
is the proposed algorithm, RecoverL2 is a provably efficient algorithm from 
□, and Gibbs is the Gibbs Sampling approximation algorithm in 1111 . In RP, 
P = 150 K, C = 0.05, and e = 10“4. 

As Fig. [5] shows, when the separability condition is strictly 
satisfied ( Semi-Syn+Novel ), the reconstruction error of RP 
converges to 0 as M becomes large and outperforms the 
approximation-based Gibbs. When the separability condition 
is not strictly satisfied (Semi-Syn), the reconstruction error of 
RP is comparable to Gibbs (a practical benchmark). 

Solid Angle and Model Selection: In our proposed algorithm 
RP, the number of topics K (the model-order) needs to be 
specified. When K is unavailable, it needs to be estimated 
from the data. Although not the focus of this work. Algo¬ 
rithmic] which identifies novel words by sorting and clustering 
the estimated solid angles of words, can be suitably modified 
to estimate K. 

Indeed, in the ideal scenario where there is no sampling 
noise (M = oo,E = E, and Vi,qi = qi), only novel words 
have positive solid angles (qis) and the rows of E correspond¬ 
ing to the novel words of the same topic are identical, i.e., the 
distance between the rows is zero or, equivalently, they are 
within a neighborhood of size zero of each other. Thus, the 
number of distinct neighborhoods of size zero among the non¬ 
zero solid angle words equals K. 

In the nonideal case M is finite. If M is sufficiently large, 
one can expect that the estimated solid angles of non-novel 


words will not all be zero. They are, however, likely to be 
much smaller than those of novel words. Thus to reliably 
estimate K one should not only exclude words with exactly 
zero solid angle estimates, but also those above some nonzero 
threshold. When M is finite, the the rows of E corresponding 
to the novel words of the same topic are unlikely to be iden¬ 
tical, but if M is sufficiently large they are likely to be close 
to each other. Thus, if the threshold £ in Algorithmic] which 
determines the size of the neighborhood for clustering all novel 
words belonging to the same topic, is made sufficiently small, 
then each neighborhood will have only novel words belonging 
to the same topic. 

With the two modifications discussed above, the number of 
distinct neighborhoods of a suitably nonzero size (determined 
by C > 0) among the words whose solid angle estimates are 
larger than some threshold r > 0 will provide an estimate of 
K. The values of r and £ should, in principle, decrease to zero 
as M increases to infinity. Leaving the task of unraveling the 
dependence of r and £ on M to future work, here we only pro¬ 
vide a brief empirical validation on both the Semi-Syn+Novel 
and Semi-Syn NYT datasets. We set M = 2,000,000 so that 
the reconstruction error has essentially converged (see Fig. [5]), 
and consider different choices of the threshold £. 

We run Algorithm [2] with K = 100, P = 150 x K, 
and a new line of code: 16’: (if {qi = 0}, break); inserted 
between lines 16 and 17 (this corresponds to r = 0). The 
input hyperparameter K = 100 is not the actual number of 
estimated topics. It should be interpreted as specifying an 
upper bound on the number of topics. The value of (little) 
k when Algorithm [2] terminates (see lines 14-21) provides an 
estimate of the number of topics. 

Figure [(^illustrates how the solid angles of all words, sorted 
in descending order, decay for different choices of £ and how 
they can be used to detect the novel words and estimate 
the value of K. We note that in both the semi-synthetic 
datasets, for a wide range of values of £ (0.1-5), the modified 
Algorithm U correctly estimates the value of K as 100. When 
£ is large (e.g., £ = 10 in Fig. [6}, many interior points would 
be declared as novel words and multiple ideal novel words 
would be grouped into one cluster resulting. This causes K to 
be underestimated (46 and 41 in Fig. |6). 

B. Real-world data 

We now describe results on the actual real-world NYT 
dataset that was used in Sec. IVI-AI to construct the semi¬ 
synthetic datasets. Since ground truth topics are unavailable, 
we measure performance using the so-called predictive held- 
out log-probability. This is a standard measure which is 
typically used to evaluate how well a learned topic model 
fits real-world data. To calculate this for each of the three 
topic estimation methods (Gibbs lUTl . H33| , RecoverL2 |f7j , 
and RP), we first randomly select 60, 000 documents to test 
the goodness of fit and use the remaining 240,000 documents 
to produce an estimate (3 of the topic matrix. Next we assume 
a Dirichlet prior on the topics and estimate its concentration 
hyper-parameter a. In Gibbs, this estimate 2 is a byproduct 
of the algorithm. In RecoverL2 and RP this can be estimated 
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Fig. 6. Solid-angles (in descending order) of all 14943 + 100 words in the Semi-Syn-\-Sep NYT dataset (left) and all 14943 words in the Semi-Syn NYT 
dataset (right) estimated (for different values of Q by Algorithm [2] with K = 100, P = 150 x K, M = 2,000,000, and a new line of code: 16’: (if 
{qi = 0}, break); inserted between lines 16 and 17. The values of j and (little) k when Algorithm [2] terminates are indicated, respectively, by the position 
of the vertical dashed line and the rectangular box next to it for different f. 


from j3 and X . We then calculate the probability of observing 
the test documents given the learned topic model (3 and or. 

log Pr(X test |/3,3) 

Since an exact evaluation of this predictive log-likelihood 
is intractable in general, we calculate it using the MCMC 
based approximation proposed in go which is now a standard 
approximation tool l33l . For RP, we use P = 150 x K, 
C, = 0.05, and e = 10 _ 4 as in Sec. IVI-AI We report the held- 
out log probability, normalized by the total number of words 
in the test documents, averaged across 5 training/testing splits. 
The results are summarized in Table Q] As shown in Table Q] 

TABLE I 

Normalized held-out log probability of RP, RecoverL2, 
and Gibbs Sampling on NYT test data. The Mean±STD’s 

ARE CALCULATED FROM 5 DIFFERENT RANDOM 
TRAINING-TESTING SPLITS. 


K 

RecoverL2 

Gibbs 

RP 

50 

100 

150 

200 

-8.22i0.56 

-7.63i0.52 

-8.03i0.38 

-7.85i0.40 

-7.42i0.45 

-7.50i0.47 

-7.31i0.41 

-7.34i0.44 

-8.54i0.52 

-7.45i0.51 

-7.84i0.48 

-7.69i0.42 


Gibbs has the best descriptive power for new documents. RP 
and RecoverL2 have similar, but somewhat lower values than 
Gibbs. This may be attributed to missing novel words that 
appear only in the test set and are crucial to the success of 
RecoverL2 and RP. Specifically, in real-world examples, there 
is a model-mismatch as a result of which the data likelihoods 
of RP and RecoverL2 suffer. 

Finally, we qualitatively access the topics produced by our 
RP algorithm. We show some example topics extracted by RP 
trained on the entire NYT dataset of M = 300,000 documents 
in Table HlPl For each topic, its most frequent words are listed. 
As can be seen, the estimated topics do form recognizable 
themes that can be assigned meaningful labels. The full list 
of all I\ = 100 topics estimated on the NYT dataset can be 
found in |3j . 

3 The zzz prefix in the NYT vocabulary is used to annotate certain special 
named entities. For example, zzz_nfl annotates NFL. 


TABLE II 

Examples of topics estimated by RP on NYT 


Topic la¬ 
bel 

Words in decreasing order of estimated probabilities 

“weather” 

weather wind air storm rain cold 

"teehng’' 

feeling sense love character heart emotion 

"election” 

election zzz florida ballot vote zzz al gore recount 

"game” 

yard game team season play zzz_nfl 


VII. Conclusion and Discussion 

This paper proposed a provably consistent and efficient al¬ 
gorithm for topic discovery. We considered a natural structural 
property - topic separability - on the topic matrix and ex¬ 
ploited its geometric implications. We resolved the necessary 
and sufficient conditions that can guarantee consistent novel 
words detection as well as separable topic estimation. We 
then proposed a random projections based algorithm that has 
not only provably polynomial statistical and computational 
complexity but also state-of-the-art performance on semi¬ 
synthetic and real-world datasets. 

While we focused on the standard centralized batch imple¬ 
mentation in this paper, it turns out that our random projections 
based scheme is naturally amenable to an efficient distributed 
implementation which is of interest when the documents are 
stored on a network of distributed servers. This is because 
the iid isotropic projection directions can be precomputed and 
shared across document servers, and counts, projections, and 
co-occurrence matrix computations have an additive structure 
which allows partial computations to be performed at each 
document server locally and then aggregated at a fusion 
center with only a small communication cost. It turns out 
that the distributed implementation can provably match the 
polynomial computational and statistical efficiency guarantees 
of its centralized counterpart. As a consequence, it provides 
a provably efficient alternative to the distributed topic esti¬ 
mation problem which has been tackled using variations of 
MCMC or Variational-Bayes in the literature ll20ll . f35il -|[37l 
This is appealing for modern web-scale databases, e.g., those 
generated by Twitter Streaming. A comprehensive theoretical 
and empirical investigation of the distributed variation of our 
algorithm can be found in 0. 







































13 


Separability of general measures: We defined and studied 
the notion of separability for a W x K topic matrix (3 which 
is a finite collection of K probability distributions over a 
finite set (of size W ). It turns out that we can extend the 
notion separability to a finite collection of measures over a 
measurable space. This necessitates making a small technical 
modification to the definition of separability to accommodate 
the possibility of only having “novel subsets” that have zero 
measure. We also show that our generalized definition of sep¬ 
arability is equivalent to the so-called irreducibility property 
of a finite collection of measures that has recently been studied 
in the context of mixture models to establish conditions for 
the identifiability of the mixing components [1381 , 11391 . 

Consider a collection of K measures v x , . . . , over a mea¬ 
surable space (X, _F), where X is a set and T is a a-algebra 
over X. We define the generalized notion of separability for 
measures as follows. 

Definition 2. (Separability) A collection of K measures 
V\. .... Vj<; over a measurable space (X,X) is separable if 
for all k = 1,..., K, 

inf max ^ = 0. (8) 

AGF: Vk>0j: j^k Vk\A) 

Separability requires that for each measure v x ,, there ex- 

(k) 

ists a sequence of measurable sets A y n ; , of nonzero mea¬ 
sure with respect to Vf~, such that, for all j ^ k, the 
ratios Vj(An )/vk(A y n ) vanish asymptotically. Intuitively, 
this means that for each measure there exists a sequence of 
nonzero-measure measurable subsets that are asymptotically 
“novel” for that measure. When X is a finite set as in topic 
modeling, this reduces to the existence of novel words as in 
Definition [j] and A^ are simply the sets of novel words for 
topic k. 

The separability property just defined is equivalent to the so- 
called irreducibility property. Informally, a collection of mea¬ 
sures is irreducible if only nonnegative linear combinations of 
them can produce a measure. Formally, 

Definition 3. (Irreducibility) A collection of K measures 
V \,.... vk over a measurable space (X, J-) is irreducible if 
the following condition holds: IfVA £ T, CkVk(A) > 0, 

then for all k = 1,..., K, Ck > 0. 

For a collection of nonzero measures^ these two properties 
are equivalent. Formally, 

Lemma 9. A collection of nonzero measures u\ , . . . , Vk over 
a measurable space (X, X) is irreducible if and only if it is 
separable. In particular, a topic matrix f3 is irreducible if and 
only if it is separable. 

The proof appears in Appendix IM1 

Topic models like LDA discussed in this paper belong to the 
larger family of Mixed Membership Latent Variable Models 
ED which have been successfully employed in a variety of 
problems that include text analysis, genetic analysis, network 
community detection, and ranking and preference discovery. 

4 A measure v is nonzero if there exists at least one measurable set A for 
which v(A) > 0. 


The structure-leveraging approach proposed in this paper can 
be potentially extended to this larger family of models. Some 
initial steps in this direction for rank and preference data are 
explored in | |32l . 

Finally, in this entire paper, the topic matrix is assumed to 
be separable. While exact separability may be an idealization, 
as shown in 02), approximate separability is both theoreti¬ 
cally inevitable and practically encountered when W K. 
Extending the results of this work to approximately separable 
topic matrices is an interesting direction for future work. Some 
steps in this direction are explored in [40] in the context of 
learning mixed membership Mallows models for rankings. 
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Appendix 

A. Proof of Lemma [7] 

Proof. The proof is by contradiction. We will show that if R 
is non-simplicial, we can construct two topic matrices (3^ 
and (3 ( ' 2> whose sets of novel words are not identical and yet 
X has the same distribution under both models. The difference 
between constructed (3^ and j3 (2 ') is not a result of column 
permutation. This will imply the impossibility of consistent 
novel word detection. 

Suppose R is non-simplicial. Then we can assume, with¬ 
out loss of generality, that its first row is within the con¬ 
vex hull of the remaining rows, i.e., Ri = 2 c 7 R ; , 

where R ; denotes the j -th row of R, and c%,. ..,Ck > 0, 
2 c i = 1 are convex combination weights. Compactly, 
e T Re = 0 where e := [— 1, C 2 ,..., Ck] T ■ Recalling that 
R = diag(a) _1 Rdiag(a) _1 , where a is a positive vector 
and R = E (d m 0 mT ) by definition, we have 

0 = e T Re= (diag(a)- 1 e) T E(6> m 6) mT )(diag(a)- 1 e) 

= E(||0 mT diag(a)- 1 e[|i), 

which implies that 0 mT diag(a) _1 e “=' 0. From 

this it follows that if we define two nonneg¬ 
ative row vectors bi := b [a]" 1 ,0,..., 0] and 
t >2 = 6 [(1 — a)aj" 1 ,ac 2 a^' 1 ,... where 

6 > 0,0 < a < lare constants, then bi 9 m “= h 20 m 
for any distribution on 6 m . 

Now we construct two separable topic matrices (3 ( { ‘ and 
as follows. Let bi be the first row and b 2 be the second 
in / 3 (■*■). Let b 2 be the first row and bi the second in (3^. 
Let B £ K w/ - 2xif be a valid separable topic matrix. Set 
the remaining (W — 2) rows of both /3* 1 ) and (3^ to be 
B (Ik — diag(bi + b 2 )). We can choose b to be small enough 
to ensure that each element of (bi + b 2 ) is strictly less than 
1. This will ensure that ( 3 1 - 1 ) anc i are column-stochastic 
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and therefore valid separable topic matrices. Observe that b 2 
has at lease two nonzero components. Thus, word 1 is novel 
for but non-novel for (3^ 2 \ 

By construction, j3^9 "= (3^0, i.e., the distribution of X 
conditioned on 6 is the same for both models. Marginalizing 
over 0, the distribution of X under each topic matrix is the 
same. Thus no algorithm can consistently distinguish between 
j3^ and (3 ( 2 ) based on X. □ 


B. Proof of Lemma \2} 

Proof The proof is by contradiction. Suppose that R is not 
affine-independent. Then there exists a A/0 with 1 T A = 0 
such that A T R = 0 so that A T RA = 0. Recalling that R = 

diag(a) - 1 Rdiag(a) -1 , we have, 

0 = A t RA= (diag(a)- 1 A) T E(0 m 0 mT )(diag(a)- 1 A) 
= E(||0 mT diag(a )- 1 A|| 2 ), 

which implies that 0 mT diag(a) _1 A “=' 0. Since A ^ 0, 
we can assume, without loss of generality, that the first 
t elements of A, Ai,...,At > 0, the next s elements 

of A, At+i,..., Af+ S < 0, and the remaining elements 
are 0 for some s, t : s > 0, t > 0, s + t < K. 
Therefore, if we define two nonnegative and nonzero row 
vectors bi := b [Aiaj“ ,..., A t aZ" 1 0 ,..., 0 ] and b 2 := 
-6 [0, ...,0,A t+ ia t _ + 1 1 ,...,A s a“ 1 ,0,...,0], where b > 0 is 
a constant, then bi 9 m “=' b 2 0 m . 

Now we construct two topic matrices / 3 d) and / 3 d) as 
follows. Let bi be the first row and b 2 the second in (3\. Let 
b 2 be the first row and bi the second in /3 2 . Let B £ 
be a valid topic matrix and assume that it is separable. Set 
the remaining (W — 2 ) rows of both / 3 * 1 ) and / 3 d) to be 
B (Ik — diag(bj + b 2 )). We can choose b to be small enough 
to ensure that each element of (b| +b 2 ) is strictly less than 1 . 
This will ensure that /3d) and / 3 d) are column-stochastic and 
therefore valid topic matrices. We note that the supports of bi 
and b 2 are disjoint and both are non-empty. They appear in 
distinct topics. 

By construction, /3d)0 a = / 3 d )0 => the distribution of 
the observation X conditioned on 0 is the same for both 
models. Marginalizing over 0, the distributions of X under the 
topic matrices are the same. Thus no algorithm can distinguish 
between /3i and /3 2 based on X. □ 


C. Proof of Proposition [7] and Proposition [2] 

Proposition [I] and Proposition [2] summarizes the relation¬ 
ships between the full-rank, affine-independence, simplicial, 
and diagonal-dominance conditions. Here we consider all the 
pairwise implication separately. 

(1) R is 7 „-affine-independent => R is at least 7 a -simplicial. 

Proof By definition of affine independence, || EjtLi A/;R/, : || 2 
> 7 a. 11 A 11 2 > 0 for all A £ R A such that EfcLi A& = 0 and 
A ^ 0. If for each i £ [K\ we set A& = 1 for k = i and 
choose Afc < 0, Vk i then (i) ||A|| 2 > 1, (ii) {—A k,k ^ *} 
are convex weights, i.e., they are nonnegative and sum to 1 , 
and (iii) EitLi = R-* - Efc^*( - ^fc)Rfc- Therefore, for 


all i £ [K], ||R.-E fc# i(-Afc)R*ll 2 > 7 a > 0 which proves 
that R is at least 7 a -simplicial. 

For the reverse implication, consider 

' 1 0 0.5 0.5" 

» _ 0 1 0.5 0.5 

0.5 0.5 1 O' 

0.5 0.5 0 1 

It is simplicial but is not affine independent (the 1,1, — 1, — 1 
combination of the 4 rows would be 0). □ 


(2) R is full rank with minimum eigenvalue 7 r =>• R is at 
least 7 r -affine-independent. 


Proof The Rayleigh-quotient characterization of the min¬ 
imum eigenvalue of a symmetric, positive-definite matrix 
R gives min^o || A T R|| 2 /|| A || 2 = 7 r > 0. Therefore, 
min^Q 1 T> k= o ||A T R|| 2 /|| A || 2 > 7 r > 0. One can construct 
examples that contradict the reverse implication: 


R = 


1 0 
0 1 
1 1 


1 

1 

2 


which is affine independent, but not linear independent. □ 


(3) R is 7 d-diagonal-dominant => R is at least 7 d-simplicial. 


Proof Noting that R,.,; — R,j > 7 ,/ > () for all i,j, 

then the distance of the first row of R, Ri, to any con¬ 
ic 

vex combination of the remaining rows, E c j R?, where 

3 =2 

c 2 ,..., c k are convex combination weights, can be lower 
k _ K 

bounded by, ||Ri — E c jRjl | 2 > |Ri,i — E c fRj',il = 


I E c i(R-i,i — Rj,i)| > 7 d > 0. Therefore, R is at least 
j =2 

7 ,/-simplicial. It is straightforward to construct examples that 
contradict the reverse implication: 


R = 


1 0 
0 1 
1 1 


1 

1 

2 


which is affine independent, hence simplicial, but not diagonal- 
dominant. □ 


(4) R being diagonal-dominant neither implies nor is implied 
by R being affine-independent. 


Proof Consider the following two examples: 


R = 


1 0 
0 1 
1 1 


1 

1 

2 


and 


1 0 0.5 0.5 

0 1 0.5 0.5 

0.5 0.5 1 0 

0.5 0.5 0 1 


They are the examples for the two sides of this assertion. □ 
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D. Proof of Lemma [7] 

Proof. Recall thatA = (30 where A and 0 are row-normalized 
version of A and Q , 0 := diag(Al)^ 1 /3 diag(01). 0 is row- 
stochastic and is separable if (3 is separable. If w is a novel 
word of topic k, 0 w k = 1 and 0 W j = 0, Vj A k. We have 
then A w = Of, - If w is a non-novel word, A w = B vl i,0f, 
is a convex combination of the rows of 0. 

We next prove that if R is 7., -simplicial with some constant 
7 8 > 0 , then, the random matrix 6 is also simplicial with high 
probability, i.e., for any c £ such that Ck = 1 , c? < 0 ,j A 
k , ~ c j ~ ^ £ [K], the M-dimensional vector c T 0 is 

not all-zero with high probability. In another words, we need 
to show that the maximum absolute value of the M entries 
in c T 6 is strictly positive. Noting that the m-th entry of c T 6 
(scaled by M) is 

Mc T O m =c T diag(a)” 1 0 m 

+ c T (diag(^ 0 d /M )~ 1 - diag(a)- 1 )0 m 
d 

the absolute value can be lower bounded as follows, 
\Mc T O m \ >|c T diag(a)" 1 0 m | 

-|c T (diag(^ 0 d /M)~ 1 - diag(a)- 1 )0 m | (9) 

d 

The key ideas are: (i) as M increases, the second term in 
Eq. © converges to 0, and (ii) the maximum of the first 
term in Eq. © among m = 1 ,,M is strictly above zero 
with high probability. For (*), recall that a = E(0 m ) and 
0 < Of < 1, by Hoeffding’s lemma Vi > 0, 

Pr(|| 0 d /M — al^ > t) < 2K exp(—2 Mt 2 ) 
d 

Also note that VO < e < 1, 

\\Y J 0 d /M- a \\ oo <eal A j2 

d 

^IKdiag^^/M )- 1 - diag(a ) _1 )|| 00 < e 
d 

^|c T (diag(^e 7 M )- 1 -diag(a)- 1 ) 0 m | < e 
d 

where a m i n is the minimum entry of a. The last inequality is 
true since J2k=i = 1 - sum ’ we have 

Pr(|c T (diag(^^ 0 d /M)~ 1 — diag(a) _ 1 )0 m | > e) 


<2KeM-Me 2 a 4 min /2) (10) 

For (ii), recall that R is 7 s -simplicial and ||c T R|| > 

_ _ _ _ 2 

7 ,,. Therefore, c T Rc = c RR'Rc > where A max 

is the maximum singular value of R. Noting that R = 
diag(a ) -1 K(6 m 0 mT ) diag(a) -1 , we get 


E(|c t diag(a)- 1 0 ,: 


> 


A, 


( 11 ) 


■max 

For convenience, let x m := |c T diag(a) _ 1 0 m | 2 < l/a 2 in - 
Then, by Hoeffding’s lemma, 

M 2 

Pr(E(x- m )-^ x m /M > < exp(-M 7 Xin/2ALx) 


Combining Eq. ([ED we get 


M 

Pr(^ x m /M < 

m= 1 


-) < exp(-M7^7n/2Amax) 


2A n 


Hence 

2 

Pr(maxa: m < —) < exp(-M7Xin/2ALJ (12) 
m=1 2 A max 

i.e., the maximum absolute value of the first term in Eq. © 
is greater than 7 s /\/ 2 Amax with high probability. 

To sum up, if we set e = 7 s /v / 2A max in Eq. (flOV . we get 


2 

Pr(max |c T 0 m | = 0) <Pr(rnaxa: m < —^—) 

771—1 777,— 1 2 Amax 

+ Pr(|c T (diag(^-)-/)0 m |>e) 

<exp(-M 7 s 4 a7 n /2A 2 ax ) 

+ 2 K exp(-M7 2 a7n/ 4 Amax) 


To summarize, the probability that 6 is not simplicial is at 
most exp(— M^aAj2\ 2 mia ) + 2K exp(-M 7 2 a 7 n / 4 A max ). 
This converges to 0 exponentially fast as M — > 00. Therefore, 
with high probability, all the row-vectors of 0 are extreme 
points of the convex hull they form and this concludes our 
proof. □ 


E. Proof of Lemma © 

Proof. We first show that if R is 7 „ affine-independent, 0 is 
also affine-independent with high probability, i.e., Vc £ R A 
such that c ^ 0, Cfc = 0 , c T 0 is not all-zero vector with 
high probability. Our proof is similar to that of Lemma© We 
first re-write the m-th entry of c T 0 (with some scaling) as, 

Mc T r =c T diag(a)- 1 0 m 

+ c T (diag(^ 0 d /M)~ 1 - diag(a) - 1 )0 m 
d 

and lower bound its absolute value by 

\Mc T 0 m \ >|c T diag(a)- 1 0 m | 

-|c T (diag(^ 0 d /M)~ 1 - diag(a) - 1 )0 m | (13) 

d 

We will then show that: ( i) as M increases, the second term in 
Eq. ( 113 b converges to 0, and (ii) the maximum of the first term 
in Eq. (fl3l > among M iid samples is strictly above zero with 
high probability. For (i), by the Cauchy-Schwartz inequality 

|c T (diag(^ 0 d /M r 1 - diag(a) _ 1 ) 0 m | 
d 

<|| c || 2 ||(diag(^] 0 d /M)~ 1 — diag(a) - 1 )0 ™|| 2 

d 

<|| c || 2 ||(diag(^ 0 d /M )~ 1 — diag(a ) _1 )|| 00 
d 

Here the last inequality is true since 0™ < 1, 0™ = 1. 

Similar to Eq. Cl. we have, 

Pr(|(diag(^^ 0 d /M)~ 1 — diag(a) -1 )<? m | > ||c|| 2 e) 
d 
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<2K exp(—Me 2 a^ in /4) (14) 

for any 0 < e < 1, a m i n is the minimum entry of a. For ( ii ), 
recall that by definition, ||c T R||2 > 'y a 11c112- Hence c T Rc > 
'Ya II c lli/^max- Therefore, by the construction of R, we have, 

E(|c T diag(a)- 1 0™| 2 /||c||^)> 1 ^- (15) 

^max 

For convenience, let x m := |c T diag(a) _1 0 m | 2 /||c||| < 
1 /a 2 nin . Following the same procedure as in Eq. (fl2b . we have, 
2 

p r(maxx m < X —) < exp{-My* a Fj2\ 2 max ) (16) 
m=i ^A max 

Therefore, if we set in Eq. ([nil e = 'y/V^X^, we get, 
Pr(max |c T 0 m | < 0) < exp(-M 7 ^a^ in /2A^ ax ) 

ra= 1 

+ 2 K exp (—M7 2 a^ in /4 A max ) 

In summary, if R is 7 „ affine-independent, 0 is also affine- 
independent with high probability. 

Now we turn to prove Lemma 0] By Lemma [3] detecting 
K distinct novel words for K topics is equivalent to knowing 
0 up to a row permutation. Noting that A w = Efc PwkQk- it 
follows that p wh , k = 1 ..... A is one optimal solution to the 
following constrained optimization problem: 

K K 

min || A w - ^ b k 0 k || 2 s.t b k > 0,^ b k = 1 
k -1 fc= 1 

Since Q is affine-independent with high probability, there¬ 
fore, this optimal solution is unique with high probability. If 
this is not true, then there would exist two distinct solutions 
b\,...,b l K and 6 2 ,... ,b 2 K such that A w = Efc=i = 
J2 k =iblO k . = J2b k = 1- We would then obtain 

K 

J2(bl-bl)d k = o 

fc=1 

where the coefficients b\ — b\ are not all zero and '^2 k b\—b\ = 
0. This would contradict the affine-independence definition. 

Finally, we check the renormalization steps. Recall that 
since diag(Al)/3 = /3diag(01), diag(Al) can be directly 
obtained from the observations. So we can first renormalize 
the rows of (3. Removing diag(01) is then simply a column 
renormalization operation (recall that /3 is column-stochastic). 
It is not necessary to know the exact the value of diag(01). 

To sum up, by solving a constrained linear regression 
followed by suitable row renormalization, we can obtain a 
unique solution which is the ground truth topic matrix. This 
concludes the proof of Lemma 0] □ 

F. Proof of Lemma 0] 

Lemma 0] establishes the second order co-occurrence es¬ 
timator in Eq. 0. We first provide a generic method to 
establish the explicit convergence bound for a function 7(X) 
of d random variables X \...., X,/, then apply it to establish 
Lemma 0] 


Proposition 3. Let X = [Xi,...,X<j] be d random vari¬ 
ables and a = [oi,...,ad] be positive constants. Let £ := 
U {|Xj — aj| > (^} for some constants Si > 0, and 4>(X.) 
iex 

be a continuously differentiable function in C := £ c . If for 
i = 1 ,...,d, Pr(|Xi —ai| > e) < /i(e) are the individual 
convergence rates and max|<9i'i/’(X)| < Ci, then, 

PrdV’(X) - ip(a)\ > e) < ^ /i(^) + ^ /i(^) 

i 2—1 

Proof. Since L(X) is continuously differentiable in C, VX £ 
C,3A £ (0,1) such that 

- t/>(a) = V T t/)((l - A)a + AX) • (X - a) 

Therefore, 

Pr(|V’(X) -V>(a)| > e) 

< Pr(X £ £)-\- 

d 

Pr(^|^((l-A)a + AX)||X i -a i | > e|X £ C) 

2=1 

< ^2 Pr(|Xi - ai| > <5i)+ 
iex 

d 

y^Pr(max|OjV:(x)||Xi - a»| > e/d) 

2=1 X 

iex 2=1 

□ 

Now we turn to prove Lemma 0| Recall that X and 
X' are obtained from X by first splitting each user’s 
comparisons into two independent halves and then re¬ 
scaling the rows to make them row-stochastic hence X = 
diag^ 1 (Xl)X. Also recall that /3 — diag _1 (/3a)/3 diag(a), 
R = diag _1 (a)Rdiag _1 (a), and /3 is row stochastic. For 
any 1 < i,j < W, 

! M 1 

Eij = M— (JZ 

E -E,m m_1 E Xi,m 

m— 1 m— 1 

M 

l/M E (X' m x iiro ) 

_ m— 1 

_ Tl M 

(l/M E E X j>m ) 

m— 1 m— 1 

M,N,N 

MN 2 E = t)i(w mn / = j) 

m=l,n=l,n / =l 

M,N M,N 

MAT E = *)m!V S = *) 

m=l,n=l m=l,n=l 

) 

' Gi (M, N)Hj (M, N) 

From the Strong Law of Large Numbers and the generative 
topic modeling procedure, 

Fid(M,N) ^ E(I(tUm,ra = i )= j)) 

= ((3Kf3 T )ij := Pij 
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Gi(M, N) ^ E(I« n = *)) = (/3a)j := Pi 
Ht{M, N) ^ E(I(u; mi „ = j)) = (/3a )j := Vj 

/flD oT \ 

and (/ 3 a) M (/ 3 aj J = by definition. Using McDiarmid’s 
inequality, we obtain 


Pr (|Fij -pi t j | > e) < 2 exp (-e 2 MN) 
Pr(|Gj — Pi| > e) < 2 exp(— 2e 2 MN) 
Pr (|Hj — Pj\ > e) < 2exp(—2 e 2 MN) 


G-h- - > e}> we apply the 


In order to calculate Pr{| 

results from Proposition [3] Let ip(xi,X 2 ,x 3 ) = with 

Xi,X 2 ,x 3 > 0, and a\ = p- h j, a 2 = pi, 03 = Pj. Let I = 
{2,3}, 62 = JPu and 63 = 'ypj. Then \d 2 tp\ = 

. If Fi j = x\, Gi = X 2 , and Hj = X 3 , 
then F, j < Gi, F-ij < Hj. Then note that 

Ci = max \d\ij}\ = max < 

C C \JX 2 Hj 

Fi 


X%X3 


, and \d^\ = 


Xl 

X 2 X 3 


C 2 = max \d 2 f>\ = max q2h 


(1 -7 ) 2 Pi P j 

1 

< max 

C 


Fi 


1,0 


C 3 = max \d 3 i/j\ = max G fj2 


GiHj 


< 


1 


< 


(1 - 7 ) 2 PiPj 
1 

(1 - 7 ) 2 Pi Pj 


By applying Proposition [3] we get 

Fi i Vi,j 


Pr{|- 


>4 


exp (-2 YPjMN) 


GiHj pipj 

< exp (—27 2 pfMN) - 

+ 2 exp(-e 2 (l - 7 ) 4 (piPj) 2 M 7 V/ 9 ) 

+ 4exp(-2e 2 (l - 7 ) 4 (p i p i ) 2 M/V/ 9 ) 

<2 exp(— 2 n / 2 ifMN) + 6 exp(—e 2 (l — 7 ) 4 7 y 4 MiV/ 9 ) 


where 77 = mini<j <w Pi- There are many strategies for 
optimizing the free parameter 7 . We set 2 ^ 2 = ^ 1 ~ 7 ^ and 
solve for 7 to obtain 

p r{| tHt ~ —\ > e} < 8exp(—e 2 77 4 MW/20) 

G l H j p^j 

Finally, by applying the union bound to the W 2 entries in E, 
we obtain the claimed result. 


H. Proof of Lemma \7\ 

Proof. We first show that when R is 7 ,, affine independent 
and (3 is separable, then Y = R/3 T is at least 7 „ affine 
independent. Similarly as in the proof of Lemma[ 6 ] we assume 
that word 1 are the novel words for topic 1 to K. 

By definition, /3 T = [Ix,B] hence Y = R/3 T = [R, RB]. 
VA £ such that A 0, A k = 0, then, 

K K 

HE Y ^l2/ll A ll2>llE^ll2/ll A ll2>7a 

k -1 k -1 

Hence Y is affine independent. The The rest of the proof will 
be exactly the same as that for Lemma 0] 

We note that once the novel words for K topics are 
detection, we can use only the corresponding columns of E 
for linear regression. Formally, let E* be the W x K matrix 
formed by the columns of the E that correspond to K distinct 
novel words. Then, E* = /3R. The rest of the proof is again 
the same as that for Lemma @] □ 

I. Proof of Lemma [S] 

Proof. We first check that if q w > 0, w must be a novel word. 
Without loss of generality let word 1,..., K be novel words 
for K distinct topics. Vu>, ~E W = ^ 3 w k^k- Vd £ 

(E w , d) = E Pwk (Efc, d) < max(E fc . d) 

and the last equality holds if, and only if, there exist some k 
such that B w k. = 1 which implies w is a novel words. 

We then show that for a novel word w, q w > 0. We 
need to show for each topic k, when d is sampled from an 
isotropic distribution in R. l4/ , there exist a set of directions 
d with nonzero probability such that (E/,, d) > (E/, d) for 
l = 1 ,,K, l f k. First, one can check by definition that 
Y = (E^, • • •, Ej-) T = R/3 t is at least 7 s -simplicial if 
R is 7 s -simplicial. Let E{ be the projection of Ei onto the 
simplex formed by the remaining row vectors E 2 ,..., E k- 
By the orthogonality principle, (Ei — E},Efc — E{) < 0 for 
k = 2,..., K. Therefore, for d 1 = E^ - E} T , 

Exd 1 - Efcd 1 = Hd 1 !! 2 - (Efc - E})d 4 > 7 2 > 0 


G. Proof of Lemma [ 6 ] 

Proof. We first show that when R is 7 „ simplicial and (3 is 
separable, then Y = R/3 ' is at least 7 s -simplicial. Without 
loss of generality we assume that word 1 ,,K are the novel 
words for topic 1 to K. By definition, /3 T = [I/<, B] hence 
Y = R^ t = [R, RB]. Therefore, for convex combination 
weights C 2 ,..., Ck > 0 such that ^ 7=2 c 3 = b 

K K 

IIYi - E Ci Y i\\ > IIR-1 - E c f^ll > 7 S > 0 

3 =2 3 =2 

Therefore the first row vector Yi is at least 7 S distant away 
from the convex hull of the remaining rows. Similarly, any 
row of Y is at least 7 ., distant away from the convex hull of 
the remaining rows hence Y is at least 7 S simplicial. The rest 
of the proof will be exactly the same as for Lemma [ 6 ] □ 


Due to the continuity of the inner product, there exist a 
neighbor on the unite sphere around d 1 /||d 1 ||2 that Ei has 
maximum projection value. This conclude our proof. □ 

J. Proof of Theorem [2] 

Proof. We first consider the random projection steps (step 3 to 
12 in Alg.H. For projection along direction d r , we first calcu¬ 
late projection values r = X'X 1 d' , find the maximizer index 
i* and the corresponding set and then evaluate I (Vj £ 
J w , v w > Vj) for all the words w in Jf = {1,..., W} \ Ji*. 
(/) The set j}» have up to |C*.| elements asymptotically, where 
k is the topic associated with word i*. This is considered a 
small constant 0( 1); (II) Note that Ed r = MX'(X T d r ) 
and each column of X has at most A f <C M / nonzero entries. 
Calculating the W x 1 projection value vector v requires two 
sparse matrix-vector multiplications and takes 0(MN) time. 
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Finding the maximum requires W running time; (III) To 
evaluate one set Ji t— [j : E iti + Ejj — 2 Eij > C/2} we 
need to calculate E i: j,j = 1...., W. This can be viewed 
as projecting E along d = and takes O(MN). We also 
note that the diagonal entries E w , w ,w = can be 

calculated once using 0(W) time. To sum up, these steps 
takes 0(MNP + WP) running time. 

We then consider the detecting and clustering steps (step 14 
to 21 in Alg. |2j. We note that all the conditions in Step 17 
have been calculated in the previous steps, and recall that the 
number of novel words are small constant per topic, then, this 
step will require a running time of 0(K 2 ). 

We last consider the topic estimation steps in Algorithm [3] 
Here all the corresponding inputs for the linear regression 
have already been computed in the projection step. Each 
linear regression has K variables and we upper bound its 
running time by 0(K 3 ). Calculating the row-normalization 
factors jyXl requires 0(MN ) time. The row and column 
re-normalization each requires at most 0 (WK) running time. 
Overall, we need a 0(WK 3 + MN) running time. 

Other steps are also efficient. Splitting each document into 
two independent halves takes linear time in N for each 
document since we can achieve it using random permutation 
over N items. To generate each random direction d r requires 
0(W) complexity if we use the spherical Gaussian prior. 
While we can directly sort the empirical estimated solid angles 
(in 0(W log(W)) time), we only search for the words with 
largest solid angles whose number is a constant w.r.t W, 
therefore it would take only 0(W) time. □ 

K. Proof of Theorem 0] 

We focus on the case when the random projection directions 
are sampled from any isotropic distribution. Our proof is not 
tied to the special form of the distribution; just its isotropic 
nature. We first provide some useful propositions. We denote 
by Ck the set of all novel word of topic k, for k £ [ K ], and 
denote by Co the set of all non-novel words. We first show. 

Proposition 4. Let E, be the i-th row of E. Suppose (3 is 
separable and R is 'y s -simplicial, then the following is true: 
For all k £ [K\, 



11 — Ej 11 

Ei^i — 2Eij Ejj 

i E C ]^, j E Ck 

0 

0 

i eCk,j <£Ck 

> (1 - b) 7 S 

> (1 — by 7s/Amax 


where b = max j g c 0 ,i Pj,i an d A max > 0 is the maximum 
eigenvalue of R 

Proof We focus on the case k = 1 since the proofs for other 
values of k are analogous. Let (3, be the i-th row vector of 
matrix /3. To show the above results, recall that E = /3R/3 T . 
Then 

l|Ej — Ej || = || (/3j — /3 j)R/ 3 t || 

Ei^i — 2 Eij + Ej t j = (fli — (3j)T{!((3i — (3j) T ■ 

It is clear that when i. j £ C±, i.e., they are both novel word 
for the same topic, /3j = (3j = ei. Hence, ||Ej — Ej || = 0 


and Eii — 2Eij + Ejj = 0. When i £ C\,j ^ Ci, we have 
Pi = [ 1,0 . 0 ], (3j = [i 3 j,i,Pj, 2 ,---,fij,K\ with j3 jA < 1 . 

Then, 

Pi ~ Pi = [1 - Pj,i, ~Pj, 2, ■ • • , -Pj.K] 

= (! - Pj,i)[ 1, -c 2 , • • •, -c K ] ■= (1 - Pj,i)e T 

and Hu _ 2 ci = 1. Therefore, defining Y := R/3 T , we get 

K 

||E i -E j || 2 = (l- i 8 i , i )l|Y 1 -X)‘ : iY ,|| 2 

1=2 

Noting that Y is at least 7 s -simplicial, we have ||Ej — E^-11 2 > 
(1 - b) 7 S where b = max jg _ Cojfc Bj,k < 1 - 

Similarly, note that ||e T R|| > 7 and let R = UEU 1 be 
its singular value decomposition. If A max is the maximum 
eigenvalue of R, then we have 

E iti - 2E id + Ejj = (1 - ft,i) 2 (e T R)US" 1 U T (e T R) T 
> (1 - b) 2 7 S 2 /A max . 

The inequality in the last step follows from the observation 
that e 1 H/ is within the column space spanned by U. □ 

The results in Proposition 0] provide two statistics for 
identifying novel words of the same topic, ||E» — E. ; 11 and 
E iti — 2Eij + Ejj. While the first is straightforward, the latter 
is efficient to calculate in practice with better computational 
complexity. Specifically, its empirical version, the set Ji in 
Algorithm [2] 

Ji = {j ■ Eij — Eij — Ejj + Ejj > d/ 2 } 

can be used to discover the set of novel words of the same 
topics asymptotically. Formally, 

Proposition 5. If ||E - E||oo < (1 - b) 2 7 a / 8 A max , then, 

1) For a novel word i £ Ck , Ji = 

2) For a non-novel word j £ Co, Ji L> C% 

Now we start to show that Algorithm [2] can detect all 
the novel words of the K distinct rankings consistently. As 
illustrated in Lemma [ 8 ] we detect the novel words by ranking 
ordering the solid angles qi. We denote the minimum solid 
angle of the K extreme points by q A . Our proof is to show 
that the estimated solid angle in Eq (0. 

1 p ^ 

P'=pJ2 ^ e J '" E ? dr ^ E * dr } ( 1? ) 

r =1 

converges to the ideal solid angle 

Qi = Pr{Vj £ 5(i), (Ei - Ej)d >0} (18) 

as M,P -U 00 . d 1 ,... ,d p are iid directions drawn from a 
isotropic distribution. For a novel word i £ C/ r , /,: = 1,... ,K, 
let S(i) = Cy., and for a non-novel word i £ Co, let S(i) = Cg. 

To show the convergence of pi to p,, we consider an 
intermediate quantity, 

Pi(E) = Pr{Vj £ J, (Ei - Ej)d > 0} 

First, by Hoeffding’s lemma, we have the following result. 
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Proposition 6. Vi > 0, Vi, 


Pr{|pi — Pi(E)| < t} > 2exp(-2 Pt 2 ) 


(19) 


Next we show the convergence of pi{ E) to solid angle q ,: 

Proposition 7. Consider the case when ||E — E||oo < | and 
R is 'y 8 -simplicial. If i is a novel word, then, 

wVw ' 

qi-Pi( E) < -—-y—1|E — Elloo 

nd 2 

Similarly, if j is a non-novel word, we have, 

wVw - 

Pj(E) — qi < ——Li —1| E — Elloo 

7TCL2 

where d 2 = (1 - b)j s , d = (1 - b) 2 7 ^/A max . 

Proof First note that, by the definition of J, and Proposi¬ 
tion [4] if ||E — Elloo < f, then, for a novel word i £ Ck, 
Ji = S{i). And for a non-novel word i £ C 0 , Ji A £(*). For 
convenience, let 

A? = {d : (Ei — Ej)d > 0} A = f| A, 

jeJt 

Bj = {d : (Ei — Ej)d > 0} B= f| Bj 

ie5(i) 

For i being a novel word, we consider 

qt ~ Pi( E) = Pr {B} - Pr{A} < Pr{B f) ,4 C } 

Note that Ji = S(i) when ||E — E|| < d/ 8 , 

Pr{Rp.4 c } = Pr{Bp( [j A/)} 

< e pr (( n ^)fi4 c }< e 

jes(i) ies(i) 

= E Pr {( E i “ Ej)d < aIld ( E i - E i) d > °} 
jes(i) 

^ 2 tt 


jeS(i) 


where d > 3 is the angle between e y = E, — E y and e 3 = E, - E j 
for any isotropic distribution on d. Noting that (j> < tan(</>), 

Pr {B(~)A C }< E 1 ll% “ 6,1,2 


< 


wVw 

7T d 2 


2n 

j&S(i) 

E-EIU 


27T 


e 7 2 


where the last inequality is obtained by the relationship 
between the 7oo norm and the I 2 norm, and the fact that for 
j £ S{i), 11ej11 2 = ||Ej - Ej-Ha > d 2 = (1 - b)j s . Therefore 
for a novel word i, we have, 

wVw ' 

qi-Pi{ E ) < —||E-E||oo 
nd 2 

Similarly for a non-novel word i £ Co, Ji A S(t), 

Pi{ E) - qi = Pr{A} - Pr{B} = Pr{A f| B c } 


< y, pr « n 

j'es(i) ies(i) 

< y p ^n^><^iiE-Eii. 

jes(i) 

□ 

A direct implication of Proposition [7] is. 

Proposition 8. Ve > 0, let p = min{|, ffe}. 7/UE-EIU < 
p, then, qi — p,;(E) < efor a novel word i and p 3 (E) — qj < e 
for a non-novel word j. 

We now prove Theorem [3] In order to correctly detect all 
the novel words of K distinct topics, we decompose the error 
event to be the union of the following two types, 

1) Sorting error, i.e., 3i £ UE 3 j £ C 0 such thatp, < 

Pj. This event is denoted as A,^ and let A = (J A, 3 . 

2) Clustering error, i.e., 3k,3i,j £ Ck such that i ^ J 3 . 
This event is denoted as Ii li3 and let B = |J li li3 

We point out that the event A, B are different from the 
notations we used in Proposition [7] According to Proposi¬ 
tion [8] we also define p = min{|, } and the event that 
C = {||E - EIU > p}- We note that B C C. 

Therefore, 

Pe = Pr{A(jR}<Pr{Ap|C' c }-FPr{C} 

< E Pr{A i , J -f|B c } + Pr{C} 

i novel,j non—novel 

< j2 Fr &-pj <o ni! E ~ E ii° o -p) 

ij 

+ Pr(||E- EIU >p) 

The second term can be bound by Lemma [5] Now we focus 
on the first term. Note that 

Pi-Pj = Pi ~ Pj - Pi(E) + pi(E) 

-qi + qi- Pj ( E ) + Pj ( E ) - qj + q 3 

= {Pi - P*(E)} + (pi(E) - qi} 

+{Pj ( E ) - Pj} + {qj - Pj{ E)} 

+qi - qj 

and the fact that qi — q 3 > q A , then,, 

Pr (Pi < Pj P| ||E - E||oo < p) 

< Pr(pj(E) - fh > g A /4) + Pr (p 3 — Pj (E) > g A /4) 

+ Prfe - Pi(E) > g A /4) p ||E - E||oo < p) 

+ Prfe(E) - Qj > W4) p ||E - E||oo < p) 

< 2 exp(—Pg^/8) 

+ Pr( 9i - Pi(E) > 9a /4) P ||E - EIU < p) 

+ Pr(Pj(E) - qj > qj 4) P ||E - E||oo < p) 

The last equality is by Proposition^ For the last two terms, by 
Proposition [8] is 0. Therefore, applying Lemma [3 we obtain, 

Pe < 2W 2 exp(-Pq 2 /8) + 8W 2 exp {-p 2 r 1 4 MN/ 20) 

And this concludes Theorem [3 
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L. Proof of Theorem @ 

Without loss of generality, let 1,..., K be the novel words 
of topic 1 to K. We first consider the solution of the con¬ 
strained linear regression. To simplify the notation, we denote 
Ej = [E^i ,..., Ei^fc] are the first K entries of a row vector 
without the super-scripts as in Algorithm [3] 

Proposition 9. Let R be "f a -affine-independent. The solution 
to the following optimization problem 

K 

b* = arg min ||E< - E II 

6j>o,E6j=i 

converges to the i-th row of (3, /3i, as M —> oo. Moreover, 

Pr(||b* - AHoc > e) < 8fL 2 exp(-) 

where 77 is define the same as in Lemma [5] 

Proof We note that A is the optimal solution to the following 
problem with ideal word co-occurrence statistics 

K 

b * = arg b J E * - E b i E J II 

j -1 

Define /(E,b) = ||E; — Ej=i 6-,E ? 11 and note the fact that 
/(E, b*) = 0. Let Y = [E^,..., E^] t . Then, 


To show the convergence rate of the above equation, it is 
straightforward to apply the result in Lemma [5] 

Proposition 10. For the row-scaled estimation Bj as in 
Eq. CP, we have, 

Pr(|B ijfe - A.fcOfel > e) < 8 W 2 exp(- 6 ^280 ^ 
Proof By Proposition 0 we have, 

Pr(|b*(i) fe - A,fc| > e/2) < 8 W 2 ex P (-^^|^) 

Recall that in Lemma 0 by McDiarmid’s inequality, we have 

Pr ^I7 XlMxl ~ - e / 2 ' ) - ex P(-e 2Miv / 2 ) 

Therefore, 

Pr(|Bi,fe - /3 

<81Y 2 exp(— 6 ) + exp(-e 2 MiV/2) 

where the second term is dominated by the first term. □ 

Finally, we consider the column normalization step to 
remove the effect of diag(a): 


K 

/(E,b)-/(E,b*) = ||E i -^6 i E i ||-0 

j=i 

K , - 

=11E& -w 11 = v ( b - b *) YYT ( b - b *) T 

3=1 

>ll b — b *ll7a 

The last equality is true by the definition of affine- 
independence. Next, note that, 

l/(Ej b) - /(E, b)| <||Ei - Ei + Y, h ^j ~ Ei)|| 
<||E i -E i ||+^6 J ||E j -E i || 

<2 max ||E^ - E„,|| 

W 

Combining the above inequalities, we obtain, 


w 


•— ^i,kj ^ ^ B w ,k 


( 21 ) 


w =1 


And k —> a k for k = 1,..., K. A worst case 

analysis on its convergence is, 

w 

Pr(| E &w,k ~ a fc | > e) < W Pr(|B ijfc - A,fcOfc| > e/W) 

W= 1 

nTT .o , e 2 MA r 7 2 ?y 4 

< 81Y 3 exp(-^—) 

v 1280 W 2 I< ’ 

Combining all the result above, we can show Vz = 

1 ,...,W,Vk = 1 

> n < 


||b* — b*|| <— {/(E,b‘) — /(E,b*)} 

7 a 

= -{/(E,b*)-/(E,b*) + /(E,b*) 

7a 

-/(E,b*) + /(E,b*)-/(E,b*)} 
<-{/(E, b*) - /(E, b*) + /(E, b*) - /(E,b* 

7a 

<^— HE-Elloo 
7a 

where the last term converges to 0 almost surely. The conver¬ 
gence rate follows directly from Lemma [5] □ 

We next consider the row renormalization. Let b*(i) be 
the optimal solution in Proposition [9] for the z-th word, and 
consider 

B 8 := b*(i) T (-i-XlMxi) Adiag(a) 


where a m i n > 0 is the minimum value of entries of a. This 
concludes the result of Theorem Q] 

M. Proof of Lemma [9] 

Proof We first show that irreducibility implies separability, or 
equivalently, if the collection is not separable, then it is not 
irreducible. Suppose that { 17 ,..., vk } is not separable. Then 
there exists some k £ [ K] and a S > 0 such that, 

inf max = $ > 0 . 

A: v k (A)> 0 j: jjtk V], (A) 

Then VA £ T : Vk(A) > 0, max 1/3 |4t > 6 . This implies 

j: j^Lk l ' k ' ' 

that VA £ T : Vk (A) > 0, 

E Pf(-A) ~ ^k(A) > 0. 

3- 


( 20 ) 
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On the other hand, \/A E T : v\~ (A) = 0, we have 

E v oi A ) - s m a ) = E v ^ A ) ^ °- 

r- j: j^k 


Thus the linear combination j ^ v j ~ ^ v k with °ne strictly 
negative coefficient —6 is nonnegative over all measurable A. 
This implies that the collection of measures { v-\ ...., } is 

not irreducible. 

We next show that separability implies irreducibility. If the 

collection of measures {v\,... ,vk} is separable, then by the 

(k) 

definition of separability, Vfc, 3An ; £ T,n = 1,2,..., such 
that i'k(An^) > 0 and Vj ^ k, —> 0 as n —> 00 . Now 

consider any linear combination of measures JT =1 c,//, which 
is nonnegative over all measurable sets, i.e., for all A £ T, 
Y^iLi c i u i( A ) > 0. Then Vfc = 1,..., K and all n > 1 we 
have. 


K 


J2 C M A( n ] )> 0 


Vk 


( 4 fc) ) 


( c 


2=1 


Vj( A n' > ) 


'E*? / 4 (fc)N | - 
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Therefore, Ck > 0 for all k and the collection of measures is 
irreducible. □ 
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